® WILEY 


ELEMENTS OF 


INFORMATION 
THEORY 


SIXON’I) EDITION 



Thomas Cover 
Joy Thomas 




ELEMENTS OF 
INFORMATION THEORY 


Second Edition 


THOMAS M. COVER 
JOY A. THOMAS 



Wl LEY- 

INTERSCIENCE 


AJOHN WILEY & SONS, INC., PUBLICATION 




ELEMENTS OF 
INFORMATION THEORY 



ELEMENTS OF 
INFORMATION THEORY 


Second Edition 


THOMAS M. COVER 
JOY A. THOMAS 



Wl LEY- 

INTERSCIENCE 


AJOHN WILEY & SONS, INC., PUBLICATION 




Copyright © 2006 by John Wiley & Sons, Inc. All rights reserved. 

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. 
Published simultaneously in Canada. 


No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any 
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, 
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without 
either the prior written permission of the Publisher, or authorization through payment of the 
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, 

MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. 

Requests to the Publisher for permission should be addressed to the Permissions Department, 

John Wiley & Sons, Inc., Ill River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, 
or online at http://www.wiley.com/go/permission. 


Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts 
in preparing this book, they make no representations or warranties with respect to the accuracy or 
completeness of the contents of this book and specifically disclaim any implied warranties of 
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales 
representatives or written sales materials. The advice and strategies contained herein may not be 
suitable for your situation. You should consult with a professional where appropriate. Neither the 
publisher nor author shall be liable for any loss of profit or any other commercial damages, including 
but not limited to special, incidental, consequential, or other damages. 


For general information on our other products and services or for technical support, please contact our 
Customer Care Department within the United States at (800) 762-2974, outside the United States at 
(317) 572-3993 or fax (317) 572-4002. 


Wiley also publishes its books in a variety of electronic formats. Some content that appears in print 
may not be available in electronic formats. For more information about Wiley products, visit our web 
site at www.wiley.com. 

Library of Congress Cataloging-in-Publication Data: 

Cover, T. M., 1938- 

Elements of information theory/by Thomas M. Cover, Joy A. Thomas-2nd ed. 
p. cm. 

“A Wiley-Interscience publication.” 

Includes bibliographical references and index. 

ISBN-13 978-0-471-24195-9 
ISBN-10 0-471-24195-4 

1. Information theory. I. Thomas, Joy A. II. Title. 

Q360.C68 2005 
003 / .54-dc22 

2005047799 


Printed in the United States of America. 


10 987654321 


CONTENTS 


Contents 


Preface to the Second Edition 
Preface to the First Edition 
Acknowledgments for the Second Edition 
Acknowledgments for the First Edition 

1 Introduction and Preview 

1.1 Preview of the Book 5 

2 Entropy, Relative Entropy, and Mutual Information 

2.1 Entropy 13 

2.2 Joint Entropy and Conditional Entropy 16 

2.3 Relative Entropy and Mutual Information 19 

2.4 Relationship Between Entropy and Mutual 
Information 20 

2.5 Chain Rules for Entropy, Relative Entropy, 
and Mutual Information 22 

2.6 Jensen’s Inequality and Its Consequences 25 

2.7 Log Sum Inequality and Its Applications 30 

2.8 Data-Processing Inequality 34 

2.9 Sufficient Statistics 35 

2.10 Fano’s Inequality 37 
Summary 41 

Problems 43 
Historical Notes 54 


xv 


XVII 


XXIII 

1 

13 


V 



CONTENTS 


Asymptotic Equipartition Property 

3.1 Asymptotic Equipartition Property Theorem 58 

3.2 Consequences of the AEP: Data Compression 60 

3.3 High-Probability Sets and the Typical Set 62 
Summary 64 

Problems 64 
Historical Notes 69 

Entropy Rates of a Stochastic Process 

4.1 Markov Chains 71 

4.2 Entropy Rate 74 

4.3 Example: Entropy Rate of a Random Walk on a 
Weighted Graph 78 

4.4 Second Law of Thermodynamics 81 

4.5 Functions of Markov Chains 84 
Summary 87 

Problems 88 
Historical Notes 100 

Data Compression 

5.1 Examples of Codes 103 

5.2 Kraft Inequality 107 

5.3 Optimal Codes 110 

5.4 Bounds on the Optimal Code Length 112 

5.5 Kraft Inequality for Uniquely Decodable 

Codes 115 

5.6 Huffman Codes 118 

5.7 Some Comments on Huffman Codes 120 

5.8 Optimality of Huffman Codes 123 

5.9 Shannon - Fano - Elias Coding 127 

5.10 Competitive Optimality of the Shannon 
Code 130 

5.11 Generation of Discrete Distributions from 
Fair Coins 134 

Summary 141 
Problems 142 
Historical Notes 157 


CONTENTS 


VII 


6 Gambling and Data Compression 

6.1 The Horse Race 159 

6.2 Gambling and Side Information 164 

6.3 Dependent Horse Races and Entropy Rate 166 

6.4 The Entropy of English 168 

6.5 Data Compression and Gambling 171 

6.6 Gambling Estimate of the Entropy of English 173 
Summary 175 

Problems 176 
Historical Notes 182 

7 Channel Capacity 

7.1 Examples of Channel Capacity 184 

7.1.1 Noiseless Binary Channel 184 

7.1.2 Noisy Channel with Nonoverlapping 
Outputs 185 

7.1.3 Noisy Typewriter 186 

7.1.4 Binary Symmetric Channel 187 

7.1.5 Binary Erasure Channel 188 

7.2 Symmetric Channels 189 

7.3 Properties of Channel Capacity 191 

7.4 Preview of the Channel Coding Theorem 191 

7.5 Definitions 192 

7.6 Jointly Typical Sequences 195 

7.7 Channel Coding Theorem 199 

7.8 Zero-Error Codes 205 

7.9 Fano’s Inequality and the Converse to the Coding 
Theorem 206 

7.10 Equality in the Converse to the Channel Coding 
Theorem 208 

7.11 Hamming Codes 210 

7.12 Feedback Capacity 216 

7.13 Source-Channel Separation Theorem 218 
Summary 222 

Problems 223 
Historical Notes 240 


159 


183 


Vlll 


CONTENTS 


243 


8 Differential Entropy 

8.1 Definitions 243 

8.2 AEP for Continuous Random Variables 245 

8.3 Relation of Differential Entropy to Discrete 
Entropy 247 

8.4 Joint and Conditional Differential Entropy 249 

8.5 Relative Entropy and Mutual Information 250 

8.6 Properties of Differential Entropy, Relative Entropy, 
and Mutual Information 252 

Summary 256 
Problems 256 
Historical Notes 259 

9 Gaussian Channel 

9.1 Gaussian Channel: Definitions 263 

9.2 Converse to the Coding Theorem for Gaussian 
Channels 268 

9.3 Bandlimited Channels 270 

9.4 Parallel Gaussian Channels 274 

9.5 Channels with Colored Gaussian Noise 277 

9.6 Gaussian Channels with Feedback 280 
Summary 289 

Problems 290 
Historical Notes 299 

10 Rate Distortion Theory 

10.1 Quantization 301 

10.2 Definitions 303 

10.3 Calculation of the Rate Distortion Function 307 

10.3.1 Binary Source 307 

10.3.2 Gaussian Source 310 

10.3.3 Simultaneous Description of Independent 
Gaussian Random Variables 312 

10.4 Converse to the Rate Distortion Theorem 315 

10.5 Achievability of the Rate Distortion Function 318 

10.6 Strongly Typical Sequences and Rate Distortion 325 

10.7 Characterization of the Rate Distortion Function 329 


261 


301 


CONTENTS 


ix 


10.8 Computation of Channel Capacity and the Rate 
Distortion Function 332 
Summary 335 
Problems 336 
Historical Notes 345 

11 Information Theory and Statistics 


11.1 

Method of Types 347 


11.2 

Law of Large Numbers 

355 

11.3 

Universal Source Coding 

357 

11.4 

Large Deviation Theory 

360 

11.5 

Examples of Sanov’s Theorem 364 

11.6 

Conditional Limit Theorem 366 

11.7 

Hypothesis Testing 375 


11.8 

Chernoff-Stein Lemma 

380 

11.9 

Chernoff Information 384 


11.10 Fisher Information and the Cramer-Rao 
Inequality 392 
Summary 397 
Problems 399 
Historical Notes 408 

12 Maximum Entropy 

12.1 Maximum Entropy Distributions 409 

12.2 Examples 411 

12.3 Anomalous Maximum Entropy Problem 413 

12.4 Spectrum Estimation 415 

12.5 Entropy Rates of a Gaussian Process 416 

12.6 Burg’s Maximum Entropy Theorem 417 
Summary 420 

Problems 421 
Historical Notes 425 

13 Universal Source Coding 

13.1 Universal Codes and Channel Capacity 428 

13.2 Universal Coding for Binary Sequences 433 

13.3 Arithmetic Coding 436 


347 


409 


427 


X 


CONTENTS 


13.4 Lempel-Ziv Coding 440 

13.4.1 Sliding Window Lempel-Ziv 
Algorithm 441 

13.4.2 Tree-Structured Lempel-Ziv 
Algorithms 442 

13.5 Optimality of Lempel-Ziv Algorithms 443 

13.5.1 Sliding Window Lempel-Ziv 
Algorithms 443 

13.5.2 Optimality of Tree-Structured Lempel-Ziv 
Compression 448 

Summary 456 
Problems 457 
Historical Notes 461 

14 Kolmogorov Complexity 


14.1 

Models of Computation 464 


14.2 

Kolmogorov Complexity: Definitions 
and Examples 466 


14.3 

Kolmogorov Complexity and Entropy 

473 

14.4 

Kolmogorov Complexity of Integers 

475 

14.5 

Algorithmically Random and Incompressible 
Sequences 476 

14.6 

Universal Probability 480 


14.7 

Kolmogorov complexity 482 


14.8 

Q 484 


14.9 

Universal Gambling 487 


14.10 

Occam’s Razor 488 



14.11 Kolmogorov Complexity and Universal 
Probability 490 

14.12 Kolmogorov Sufficient Statistic 496 

14.13 Minimum Description Length Principle 500 
Summary 501 

Problems 503 
Historical Notes 507 

15 Network Information Theory 

15.1 Gaussian Multiple-User Channels 513 


463 


509 


CONTENTS 


xi 


15.1.1 Single-User Gaussian Channel 513 

15.1.2 Gaussian Multiple-Access Channel 
with m Users 514 

15.1.3 Gaussian Broadcast Channel 515 

15.1.4 Gaussian Relay Channel 516 

15.1.5 Gaussian Interference Channel 518 

15.1.6 Gaussian Two-Way Channel 519 

15.2 Jointly Typical Sequences 520 

15.3 Multiple-Access Channel 524 

15.3.1 Achievability of the Capacity Region for the 
Multiple-Access Channel 530 

15.3.2 Comments on the Capacity Region for the 
Multiple-Access Channel 532 

15.3.3 Convexity of the Capacity Region of the 
Multiple-Access Channel 534 

15.3.4 Converse for the Multiple-Access 
Channel 538 

15.3.5 m-User Multiple-Access Channels 543 

15.3.6 Gaussian Multiple-Access Channels 544 

15.4 Encoding of Correlated Sources 549 

15.4.1 Achievability of the Slepian-Wolf 
Theorem 551 

15.4.2 Converse for the Slepian-Wolf 
Theorem 555 

15.4.3 Slepian-Wolf Theorem for Many 
Sources 556 

15.4.4 Interpretation of Slepian-Wolf 
Coding 557 

15.5 Duality Between Slepian-Wolf Encoding and 

Multiple-Access Channels 558 

15.6 Broadcast Channel 560 

15.6.1 Definitions for a Broadcast Channel 563 

15.6.2 Degraded Broadcast Channels 564 

15.6.3 Capacity Region for the Degraded Broadcast 
Channel 565 

15.7 Relay Channel 571 

15.8 Source Coding with Side Information 575 

15.9 Rate Distortion with Side Information 580 


xii 


CONTENTS 


15.10 General Multiterminal Networks 587 
Summary 594 
Problems 596 
Historical Notes 609 

16 Information Theory and Portfolio Theory 

16.1 The Stock Market: Some Definitions 613 

16.2 Kuhn-Tucker Characterization of the Log-Optimal 
Portfolio 617 

16.3 Asymptotic Optimality of the Log-Optimal 
Portfolio 619 

16.4 Side Information and the Growth Rate 621 

16.5 Investment in Stationary Markets 623 

16.6 Competitive Optimality of the Log-Optimal 
Portfolio 627 

16.7 Universal Portfolios 629 

16.7.1 Finite-Horizon Universal Portfolios 631 

16.7.2 Horizon-Free Universal Portfolios 638 

16.8 Shannon-McMillan - Breiman Theorem 
(General AEP) 644 

Summary 650 
Problems 652 
Historical Notes 655 

17 Inequalities in Information Theory 


17.1 

Basic Inequalities of Information Theory 

657 

17.2 

Differential Entropy 660 


17.3 

Bounds on Entropy and Relative Entropy 

663 

17.4 

Inequalities for Types 665 


17.5 

Combinatorial Bounds on Entropy 666 


17.6 

Entropy Rates of Subsets 667 


17.7 

Entropy and Fisher Information 671 


17.8 

Entropy Power Inequality and Brunn-Minkowski 
Inequality 674 

17.9 

Inequalities for Determinants 679 



613 


657 


CONTENTS 


Xlll 


17.10 Inequalities for Ratios of Determinants 683 
Summary 686 
Problems 686 
Historical Notes 687 


Bibliography 

689 

List of Symbols 

723 

Index 

727 



PREFACE TO THE 
SECOND EDITION 


In the years since the publication of the first edition, there were many 
aspects of the book that we wished to improve, to rearrange, or to expand, 
but the constraints of reprinting would not allow us to make those changes 
between printings. In the new edition, we now get a chance to make some 
of these changes, to add problems, and to discuss some topics that we had 
omitted from the first edition. 

The key changes include a reorganization of the chapters to make 
the book easier to teach, and the addition of more than two hundred 
new problems. We have added material on universal portfolios, universal 
source coding, Gaussian feedback capacity, network information theory, 
and developed the duality of data compression and channel capacity. A 
new chapter has been added and many proofs have been simplified. We 
have also updated the references and historical notes. 

The material in this book can be taught in a two-quarter sequence. The 
first quarter might cover Chapters 1 to 9, which includes the asymptotic 
equipartition property, data compression, and channel capacity, culminat¬ 
ing in the capacity of the Gaussian channel. The second quarter could 
cover the remaining chapters, including rate distortion, the method of 
types, Kolmogorov complexity, network information theory, universal 
source coding, and portfolio theory. If only one semester is available, we 
would add rate distortion and a single lecture each on Kolmogorov com¬ 
plexity and network information theory to the first semester. A web site, 
http://www.elementsofinformationtheory.com, provides links to additional 
material and solutions to selected problems. 

In the years since the first edition of the book, information theory 
celebrated its 50th birthday (the 50th anniversary of Shannon’s original 
paper that started the field), and ideas from information theory have been 
applied to many problems of science and technology, including bioin¬ 
formatics, web search, wireless communication, video compression, and 
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others. The list of applications is endless, but it is the elegance of the 
fundamental mathematics that is still the key attraction of this area. We 
hope that this book will give some insight into why we believe that this 
is one of the most interesting areas at the intersection of mathematics, 
physics, statistics, and engineering. 


Tom Cover 
Joy Thomas 


Palo Alto, California 
January 2006 


PREFACE TO THE 
FIRST EDITION 


This is intended to be a simple and accessible book on information theory. 
As Einstein said, “Everything should be made as simple as possible, but no 
simpler.” Although we have not verified the quote (first found in a fortune 
cookie), this point of view drives our development throughout the book. 
There are a few key ideas and techniques that, when mastered, make the 
subject appear simple and provide great intuition on new questions. 

This book has arisen from over ten years of lectures in a two-quarter 
sequence of a senior and first-year graduate-level course in information 
theory, and is intended as an introduction to information theory for stu¬ 
dents of communication theory, computer science, and statistics. 

There are two points to be made about the simplicities inherent in infor¬ 
mation theory. First, certain quantities like entropy and mutual information 
arise as the answers to fundamental questions. For example, entropy is 
the minimum descriptive complexity of a random variable, and mutual 
information is the communication rate in the presence of noise. Also, 
as we shall point out, mutual information corresponds to the increase in 
the doubling rate of wealth given side information. Second, the answers 
to information theoretic questions have a natural algebraic structure. For 
example, there is a chain rule for entropies, and entropy and mutual infor¬ 
mation are related. Thus the answers to problems in data compression 
and communication admit extensive interpretation. We all know the feel¬ 
ing that follows when one investigates a problem, goes through a large 
amount of algebra, and finally investigates the answer to find that the 
entire problem is illuminated not by the analysis but by the inspection of 
the answer. Perhaps the outstanding examples of this in physics are New- 
ton’s laws and Schrodinger’s wave equation. Who could have foreseen the 
awesome philosophical interpretations of Schrodinger^ wave equation? 

In the text we often investigate properties of the answer before we look 
at the question. For example, in Chapter 2, we define entropy, relative 
entropy, and mutual information and study the relationships and a few 
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interpretations of them, showing how the answers fit together in various 
ways. Along the way we speculate on the meaning of the second law of 
thermodynamics. Does entropy always increase? The answer is yes and 
no. This is the sort of result that should please experts in the area but 
might be overlooked as standard by the novice. 

In fact, that brings up a point that often occurs in teaching. It is fun 
to find new proofs or slightly new results that no one else knows. When 
one presents these ideas along with the established material in class, the 
response is “sure, sure, sure.” But the excitement of teaching the material 
is greatly enhanced. Thus we have derived great pleasure from investigat¬ 
ing a number of new ideas in this textbook. 

Examples of some of the new material in this text include the chapter 
on the relationship of information theory to gambling, the work on the uni¬ 
versality of the second law of thermodynamics in the context of Markov 
chains, the joint typicality proofs of the channel capacity theorem, the 
competitive optimality of Huffman codes, and the proof of Burg’s theorem 
on maximum entropy spectral density estimation. Also, the chapter on 
Kolmogorov complexity has no counterpart in other information theory 
texts. We have also taken delight in relating Fisher information, mutual 
information, the central limit theorem, and the Brunn-Minkowski and 
entropy power inequalities. To our surprise, many of the classical results 
on determinant inequalities are most easily proved using information the¬ 
oretic inequalities. 

Even though the field of information theory has grown considerably 
since Shannon’s original paper, we have strived to emphasize its coher¬ 
ence. While it is clear that Shannon was motivated by problems in commu¬ 
nication theory when he developed information theory, we treat informa¬ 
tion theory as a field of its own with applications to communication theory 
and statistics. We were drawn to the field of information theory from 
backgrounds in communication theory, probability theory, and statistics, 
because of the apparent impossibility of capturing the intangible concept 
of information. 

Since most of the results in the book are given as theorems and proofs, 
we expect the elegance of the results to speak for themselves. In many 
cases we actually describe the properties of the solutions before the prob¬ 
lems. Again, the properties are interesting in themselves and provide a 
natural rhythm for the proofs that follow. 

One innovation in the presentation is our use of long chains of inequal¬ 
ities with no intervening text followed immediately by the explanations. 
By the time the reader comes to many of these proofs, we expect that he 
or she will be able to follow most of these steps without any explanation 
and will be able to pick out the needed explanations. These chains of 
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inequalities serve as pop quizzes in which the reader can be reassured 
of having the knowledge needed to prove some important theorems. The 
natural flow of these proofs is so compelling that it prompted us to flout 
one of the cardinal rules of technical writing; and the absence of verbiage 
makes the logical necessity of the ideas evident and the key ideas per¬ 
spicuous. We hope that by the end of the book the reader will share our 
appreciation of the elegance, simplicity, and naturalness of information 
theory. 

Throughout the book we use the method of weakly typical sequences, 
which has its origins in Shannon’s original 1948 work but was formally 
developed in the early 1970s. The key idea here is the asymptotic equipar- 
tition property, which can be roughly paraphrased as “Almost everything 
is almost equally probable.” 

Chapter 2 includes the basic algebraic relationships of entropy, relative 
entropy, and mutual information. The asymptotic equipartition property 
(AEP) is given central prominence in Chapter 3. This leads us to dis¬ 
cuss the entropy rates of stochastic processes and data compression in 
Chapters 4 and 5. A gambling sojourn is taken in Chapter 6, where the 
duality of data compression and the growth rate of wealth is developed. 

The sensational success of Kolmogorov complexity as an intellectual 
foundation for information theory is explored in Chapter 14. Here we 
replace the goal of finding a description that is good on the average with 
the goal of finding the universally shortest description. There is indeed 
a universal notion of the descriptive complexity of an object. Here also 
the wonderful number Q is investigated. This number, which is the binary 
expansion of the probability that a Turing machine will halt, reveals many 
of the secrets of mathematics. 

Channel capacity is established in Chapter 7. The necessary material 
on differential entropy is developed in Chapter 8, laying the groundwork 
for the extension of previous capacity theorems to continuous noise chan¬ 
nels. The capacity of the fundamental Gaussian channel is investigated in 
Chapter 9. 

The relationship between information theory and statistics, first studied 
by Kullback in the early 1950s and relatively neglected since, is developed 
in Chapter 11. Rate distortion theory requires a little more background 
than its noiseless data compression counterpart, which accounts for its 
placement as late as Chapter 10 in the text. 

The huge subject of network information theory, which is the study 
of the simultaneously achievable flows of information in the presence of 
noise and interference, is developed in Chapter 15. Many new ideas come 
into play in network information theory. The primary new ingredients are 
interference and feedback. Chapter 16 considers the stock market, which is 
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the generalization of the gambling processes considered in Chapter 6, and 
shows again the close correspondence of information theory and gambling. 

Chapter 17, on inequalities in information theory, gives us a chance to 
recapitulate the interesting inequalities strewn throughout the book, put 
them in a new framework, and then add some interesting new inequalities 
on the entropy rates of randomly drawn subsets. The beautiful relationship 
of the Brunn-Minkowski inequality for volumes of set sums, the entropy 
power inequality for the effective variance of the sum of independent 
random variables, and the Fisher information inequalities are made explicit 
here. 

We have made an attempt to keep the theory at a consistent level. 
The mathematical level is a reasonably high one, probably the senior or 
first-year graduate level, with a background of at least one good semester 
course in probability and a solid background in mathematics. We have, 
however, been able to avoid the use of measure theory. Measure theory 
comes up only briefly in the proof of the AEP for ergodic processes in 
Chapter 16. This fits in with our belief that the fundamentals of infor¬ 
mation theory are orthogonal to the techniques required to bring them to 
their full generalization. 

The essential vitamins are contained in Chapters 2 ， 3 ， 4 ， 5 ， 7, 8, 9, 
11 ， 10， and 15. This subset of chapters can be read without essential 
reference to the others and makes a good core of understanding. In our 
opinion, Chapter 14 on Kolmogorov complexity is also essential for a deep 
understanding of information theory. The rest, ranging from gambling to 
inequalities, is part of the terrain illuminated by this coherent and beautiful 
subject. 

Every course has its first lecture, in which a sneak preview and overview 
of ideas is presented. Chapter 1 plays this role. 


Tom Cover 
Joy Thomas 


Palo Alto, California 
June 1990 
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CHAPTER 1 


INTRODUCTION AND PREVIEW 


Information theory answers two fundamental questions in communication 
theory: What is the ultimate data compression (answer: the entropy H )， 
and what is the ultimate transmission rate of communication (answer: the 
channel capacity C). For this reason some consider information theory 
to be a subset of communication theory. We argue that it is much more. 
Indeed, it has fundamental contributions to make in statistical physics 
(thermodynamics), computer science (Kolmogorov complexity or algo¬ 
rithmic complexity), statistical inference (Occam’s Razor: “The simplest 
explanation is best”)，and to probability and statistics (error exponents for 
optimal hypothesis testing and estimation). 

This “first lecture” chapter goes backward and forward through infor¬ 
mation theory and its naturally related ideas. The full definitions and study 
of the subject begin in Chapter 2. Figure 1.1 illustrates the relationship 
of information theory to other fields. As the figure suggests, information 
theory intersects physics (statistical mechanics), mathematics (probability 
theory), electrical engineering (communication theory), and computer sci¬ 
ence (algorithmic complexity). We now describe the areas of intersection 
in greater detail. 

Electrical Engineering (Communication Theory). In the early 1940s 
it was thought to be impossible to send information at a positive rate 
with negligible probability of error. Shannon surprised the communica¬ 
tion theory community by proving that the probability of error could be 
made nearly zero for all communication rates below channel capacity. 
The capacity can be computed simply from the noise characteristics of 
the channel. Shannon further argued that random processes such as music 
and speech have an irreducible complexity below which the signal cannot 
be compressed. This he named the entropy, in deference to the parallel 
use of this word in thermodynamics, and argued that if the entropy of the 
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source is less than the capacity of the channel, asymptotically error-free 
communication can be achieved. 

Information theory today represents the extreme points of the set of 
all possible communication schemes, as shown in the fanciful Figure 1.2. 
The data compression minimum I{X\ X) lies at one extreme of the set of 
communication ideas. All data compression schemes require description 
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rates at least equal to this minimum. At the other extreme is the data 
transmission maximum I{X\ Y), known as the channel capacity. Thus, 
all modulation schemes and data compression schemes lie between these 
limits. 

Information theory also suggests means of achieving these ultimate 
limits of communication. However, these theoretically optimal communi¬ 
cation schemes, beautiful as they are, may turn out to be computationally 
impractical. It is only because of the computational feasibility of sim¬ 
ple modulation and demodulation schemes that we use them rather than 
the random coding and nearest-neighbor decoding rule suggested by Shan¬ 
non^ proof of the channel capacity theorem. Progress in integrated circuits 
and code design has enabled us to reap some of the gains suggested by 
Shannon’s theory. Computational practicality was finally achieved by the 
advent of turbo codes. A good example of an application of the ideas of 
information theory is the use of error-correcting codes on compact discs 
and DVDs. 

Recent work on the communication aspects of information theory has 
concentrated on network information theory: the theory of the simultane¬ 
ous rates of communication from many senders to many receivers in the 
presence of interference and noise. Some of the trade-offs of rates between 
senders and receivers are unexpected, and all have a certain mathematical 
simplicity. A unifying theory, however, remains to be found. 

Computer Science (Kolmogorov Complexity). Kolmogorov, 
Chaitin, and Solomonoff put forth the idea that the complexity of a string 
of data can be defined by the length of the shortest binary computer 
program for computing the string. Thus, the complexity is the minimal 
description length. This definition of complexity turns out to be universal, 
that is, computer independent, and is of fundamental importance. Thus, 
Kolmogorov complexity lays the foundation for the theory of descriptive 
complexity. Gratifyingly, the Kolmogorov complexity K is approximately 
equal to the Shannon entropy H if the sequence is drawn at random from 
a distribution that has entropy H. So the tie-in between information theory 
and Kolmogorov complexity is perfect. Indeed, we consider Kolmogorov 
complexity to be more fundamental than Shannon entropy. It is the ulti¬ 
mate data compression and leads to a logically consistent procedure for 
inference. 

There is a pleasing complementary relationship between algorithmic 
complexity and computational complexity. One can think about computa¬ 
tional complexity (time complexity) and Kolmogorov complexity (pro¬ 
gram length or descriptive complexity) as two axes corresponding to 
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program running time and program length. Kolmogorov complexity fo¬ 
cuses on minimizing along the second axis, and computational complexity 
focuses on minimizing along the first axis. Little work has been done on 
the simultaneous minimization of the two. 

Physics (Thermodynamics). Statistical mechanics is the birthplace of 
entropy and the second law of thermodynamics. Entropy always increases. 
Among other things, the second law allows one to dismiss any claims to 
perpetual motion machines. We discuss the second law briefly in Chapter 4. 

Mathematics (Probability Theory and Statistics). The fundamental 
quantities of information theory — entropy, relative entropy, and mutual 
information —— are defined as functionals of probability distributions. In 
turn, they characterize the behavior of long sequences of random variables 
and allow us to estimate the probabilities of rare events (large deviation 
theory) and to find the best error exponent in hypothesis tests. 

Philosophy of Science (Occam’s Razor). William of Occam said 
“Causes shall not be multiplied beyond necessity,” or to paraphrase it, 
“The simplest explanation is best.” Solomonoff and Chaitin argued per¬ 
suasively that one gets a universally good prediction procedure if one takes 
a weighted combination of all programs that explain the data and observes 
what they print next. Moreover, this inference will work in many problems 
not handled by statistics. For example, this procedure will eventually pre¬ 
dict the subsequent digits of n. When this procedure is applied to coin flips 
that come up heads with probability 0.7, this too will be inferred. When 
applied to the stock market, the procedure should essentially find all the 
“laws” of the stock market and extrapolate them optimally. In principle, 
such a procedure would have found Newton’s laws of physics. Of course, 
such inference is highly impractical, because weeding out all computer 
programs that fail to generate existing data will take impossibly long. We 
would predict what happens tomorrow a hundred years from now. 

Economics (Investment). Repeated investment in a stationary stock 
market results in an exponential growth of wealth. The growth rate of 
the wealth is a dual of the entropy rate of the stock market. The paral¬ 
lels between the theory of optimal investment in the stock market and 
information theory are striking. We develop the theory of investment to 
explore this duality. 

Computation vs. Communication. As we build larger computers 
out of smaller components, we encounter both a computation limit and 
a communication limit. Computation is communication limited and com¬ 
munication is computation limited. These become intertwined, and thus 
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all of the developments in communication theory via information theory 
should have a direct impact on the theory of computation. 


1.1 PREVIEW OF THE BOOK 

The initial questions treated by information theory lay in the areas of 
data compression and transmission. The answers are quantities such as 
entropy and mutual information, which are functions of the probability 
distributions that underlie the process of communication. A few definitions 
will aid the initial discussion. We repeat these definitions in Chapter 2. 

The entropy of a random variable X with a probability mass function 
p(x) is defined by 

//(X) = —^p(x)log 2j p(x)_ (1.1) 

X 

We use logarithms to base 2. The entropy will then be measured in bits. 
The entropy is a measure of the average uncertainty in the random vari¬ 
able. It is the number of bits on average required to describe the random 
variable. 


Example 1.1.1 Consider a random variable that has a uniform distribu¬ 
tion over 32 outcomes. To identify an outcome, we need a label that takes 
on 32 different values. Thus, 5-bit strings suffice as labels. 

The entropy of this random variable is 

32 32 i i 

H(X) = ~^p(i)logp(i) = - L 五 log 五 =log 32 = 5 bits, 

i=l i=l ^ 

( 1 . 2 ) 

which agrees with the number of bits needed to describe X. In this case, 
all the outcomes have representations of the same length. 


Now consider an example with nonuniform distribution. 


Example 1.1.2 Suppose that we have a horse race with eight horses 
taking part. Assume that the probabilities of winning for the eight horses 
⑽ （ U，H 忐 H 忐 ) • We can calculate the entropy of the horse 
race as 


H(X) = --log- 


1 1 

4 l0g 4 


1 1 

8 l0g 8 


— log - 4 — log — 

16 6 16 64 6 64 
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Suppose that we wish to send a message indicating which horse won 
the race. One alternative is to send the index of the winning horse. This 
description requires 3 bits for any of the horses. But the win probabilities 
are not uniform. It therefore makes sense to use shorter descriptions for the 
more probable horses and longer descriptions for the less probable ones, 
so that we achieve a lower average description length. For example, we 
could use the following set of bit strings to represent the eight horses: 0, 
10, 110, 1110, 111100, 111101 ， 111110, 111111. The average description 
length in this case is 2 bits, as opposed to 3 bits for the uniform code. 
Notice that the average description length in this case is equal to the 
entropy. In Chapter 5 we show that the entropy of a random variable is 
a lower bound on the average number of bits required to represent the 
random variable and also on the average number of questions needed to 
identify the variable in a game of “20 questions.” We also show how to 
construct representations that have an average length within 1 bit of the 
entropy. 

The concept of entropy in information theory is related to the concept of 
entropy in statistical mechanics. If we draw a sequence of n independent 
and identically distributed (i.i.d.) random variables, we will show that the 
probability of a “typical” sequence is about 2 — nH ( x 、and that there are 
about 2 nH 、 x 、such typical sequences. This property [known as the asymp¬ 
totic equipartition property (AEP)] is the basis of many of the proofs in 
information theory. We later present other problems for which entropy 
arises as a natural answer (e.g., the number of fair coin flips needed to 
generate a random variable). 

The notion of descriptive complexity of a random variable can be 
extended to define the descriptive complexity of a single string. The Kol¬ 
mogorov complexity of a binary string is defined as the length of the 
shortest computer program that prints out the string. It will turn out that 
if the string is indeed random, the Kolmogorov complexity is close to 
the entropy. Kolmogorov complexity is a natural framework in which 
to consider problems of statistical inference and modeling and leads to 
a clearer understanding of Occam’s Razor.. “The simplest explanation is 
best.” We describe some simple properties of Kolmogorov complexity in 
Chapter 1. 

Entropy is the uncertainty of a single random variable. We can define 
conditional entropy H{X\Y), which is the entropy of a random variable 
conditional on the knowledge of another random variable. The reduction 
in uncertainty due to another random variable is called the mutual infor¬ 
mation. For two random variables X and Y this reduction is the mutual 


information 
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The mutual information /(X; y) is a measure of the dependence between 
the two random variables. It is symmetric in X and Y and always non¬ 
negative and is equal to zero if and only if X and Y are independent. 

A communication channel is a system in which the output depends 
probabilistically on its input. It is characterized by a probability transition 
matrix p(y\x) that determines the conditional distribution of the output 
given the input. For a communication channel with input X and output 
Y, we can define the capacity C by 


C = max/(X;Y). (1.5) 

p{x) 


Later we show that the capacity is the maximum rate at which we can send 
information over the channel and recover the information at the output 
with a vanishingly low probability of error. We illustrate this with a few 
examples. 


Example /. 1.3 {Noiseless binary channel) For this channel, the binary 
input is reproduced exactly at the output. This channel is illustrated in 
Figure 1.3. Here, any transmitted bit is received without error. Hence, 
in each transmission, we can send 1 bit reliably to the receiver, and the 
capacity is 1 bit. We can also calculate the information capacity C = 
max I (X ; y) = 1 bit. 

Example 1.1.4 {Noisy four-symbol channel) Consider the channel 
shown in Figure 1.4. In this channel, each input letter is received either as 
the same letter with probability j or as the next letter with probability 
If we use all four input symbols, inspection of the output would not reveal 
with certainty which input symbol was sent. If, on the other hand, we use 


FIGURE 1.3. Noiseless binary channel. C = l bit. 
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FIGURE 1.4. Noisy channel. 


only two of the inputs (1 and 3, say), we can tell immediately from the 
output which input symbol was sent. This channel then acts like the noise¬ 
less channel of Example 1.1.3, and we can send 1 bit per transmission 
over this channel with no errors. We can calculate the channel capacity 
C = max Y) in this case, and it is equal to 1 bit per transmission, 
in agreement with the analysis above. 

In general, communication channels do not have the simple structure of 
this example, so we cannot always identify a subset of the inputs to send 
information without error. But if we consider a sequence of transmissions, 
all channels look like this example and we can then identify a subset of the 
input sequences (the codewords) that can be used to transmit information 
over the channel in such a way that the sets of possible output sequences 
associated with each of the codewords are approximately disjoint. We can 
then look at the output sequence and identify the input sequence with a 
vanishingly low probability of error. 

Example 1.1.5 (Binary symmetric channel) This is the basic example 
of a noisy communication system. The channel is illustrated in Figure 1.5. 



FIGURE 1.5. Binary symmetric channel. 
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The channel has a binary input, and its output is equal to the input with 
probability 1 _ p. With probability p, on the other hand, a 0 is received 
as a 1, and vice versa. In this case, the capacity of the channel can be cal¬ 
culated to be C = 1 + p log p + (l — p) log(l — p) bits per transmission. 
However, it is no longer obvious how one can achieve this capacity. If we 
use the channel many times, however, the channel begins to look like the 
noisy four-symbol channel of Example 1.1.4, and we can send informa¬ 
tion at a rate C bits per transmission with an arbitrarily low probability 
of error. 


The ultimate limit on the rate of communication of information over 
a channel is given by the channel capacity. The channel coding theorem 
shows that this limit can be achieved by using codes with a long block 
length. In practical communication systems, there are limitations on the 
complexity of the codes that we can use, and therefore we may not be 
able to achieve capacity. 

Mutual information turns out to be a special case of a more general 
quantity called relative entropy D(p\\q), which is a measure of the “dis- 
tance” between two probability mass functions p and q. It is defined 
as 


D(p\\q )= 二 pO：) log 


P(x) 

qM' 


( 1 . 6 ) 


Although relative entropy is not a true metric, it has some of the properties 
of a metric. In particular, it is always nonnegative and is zero if and only 
if p = q. Relative entropy arises as the exponent in the probability of 
error in a hypothesis test between distributions p and q. Relative entropy 
can be used to define a geometry for probability distributions that allows 
us to interpret many of the results of large deviation theory. 

There are a number of parallels between information theory and the 
theory of investment in a stock market. A stock market is defined by a 
random vector X whose elements are nonnegative numbers equal to the 
ratio of the price of a stock at the end of a day to the price at the beginning 
of the day. For a stock market with distribution F(x), we can define the 
doubling rate W as 


W = max / logb’x dF(x). 

b:bi>0,J2bi = l J 


(1.7) 


The doubling rate is the maximum asymptotic exponent in the growth 
of wealth. The doubling rate has a number of properties that parallel the 
properties of entropy. We explore some of these properties in Chapter 16. 
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The quantities H, I, C, D, K, W arise naturally in the following areas: 


• Data compression. The entropy // of a random variable is a lower 
bound on the average length of the shortest description of the random 
variable. We can construct descriptions with average length within 1 
bit of the entropy. If we relax the constraint of recovering the source 
perfectly, we can then ask what communication rates are required to 
describe the source up to distortion D1 And what channel capacities 
are sufficient to enable the transmission of this source over the chan¬ 
nel and its reconstruction with distortion less than or equal to D1 
This is the subject of rate distortion theory. 

When we try to formalize the notion of the shortest description 
for nonrandom objects, we are led to the definition of Kolmogorov 
complexity K • Later, we show that Kolmogorov complexity is uni¬ 
versal and satisfies many of the intuitive requirements for the theory 
of shortest descriptions. 

• Data transmission. We consider the problem of transmitting infor¬ 
mation so that the receiver can decode the message with a small prob¬ 
ability of error. Essentially, we wish to find codewords (sequences 
of input symbols to a channel) that are mutually far apart in the 
sense that their noisy versions (available at the output of the channel) 
are distinguishable. This is equivalent to sphere packing in high¬ 
dimensional space. For any set of codewords it is possible to calculate 
the probability that the receiver will make an error (i.e.，make an 
incorrect decision as to which codeword was sent). However, in most 
cases, this calculation is tedious. 

Using a randomly generated code, Shannon showed that one can 
send information at any rate below the capacity C of the channel 
with an arbitrarily low probability of error. The idea of a randomly 
generated code is very unusual. It provides the basis for a simple 
analysis of a very difficult problem. One of the key ideas in the proof 
is the concept of typical sequences. The capacity C is the logarithm 
of the number of distinguishable input signals. 

• Network information theory. Each of the topics mentioned previously 
involves a single source or a single channel. What if one wishes to com¬ 
press each of many sources and then put the compressed descriptions 
together into a joint reconstruction of the sources? This problem is 
solved by the Slepian-Wolf theorem. Or what if one has many senders 
sending information independently to a common receiver? What is the 
channel capacity of this channel? This is the multiple-access channel 
solved by Liao and Ahlswede. Or what if one has one sender and many 
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receivers and wishes to communicate (perhaps different) information 
simultaneously to each of the receivers? This is the broadcast channel. 
Finally, what if one has an arbitrary number of senders and receivers in 
an environment of interference and noise. What is the capacity region 
of achievable rates from the various senders to the receivers? This is 
the general network information theory problem. All of the preceding 
problems fall into the general area of multiple-user or network informa¬ 
tion theory. Although hopes for a comprehensive theory for networks 
maybe beyond current research techniques, there is still some hope that 
all the answers involve only elaborate forms of mutual information and 
relative entropy. 

• Ergodic theory. The asymptotic equipartition theorem states that most 
sample n-sequences of an ergodic process have probability about 2~ nH 
and that there are about 2 nH such typical sequences. 

• Hypothesis testing. The relative entropy D arises as the exponent in 
the probability of error in a hypothesis test between two distributions. 
It is a natural measure of distance between distributions. 

• Statistical mechanics. The entropy H arises in statistical mechanics 
as a measure of uncertainty or disorganization in a physical system. 
Roughly speaking, the entropy is the logarithm of the number of 
ways in which the physical system can be configured. The second law 
of thermodynamics says that the entropy of a closed system cannot 
decrease. Later we provide some interpretations of the second law. 

• Quantum mechanics . Here, von Neumann entropy S = tr(p In p)= 

Xi log Xi plays the role of classical Shannon-Boltzmann entropy 
H = — J]- pi log pi. Quantum mechanical versions of data compres¬ 
sion and channel capacity can then be found. 

• Inference. We can use the notion of Kolmogorov complexity K to 
find the shortest description of the data and use that as a model to 
predict what comes next. A model that maximizes the uncertainty or 
entropy yields the maximum entropy approach to inference. 

• Gambling and investment. The optimal exponent in the growth rate 
of wealth is given by the doubling rate W. For a horse race with 
uniform odds, the sum of the doubling rate W and the entropy H is 
constant. The increase in the doubling rate due to side information is 
equal to the mutual information I between a horse race and the side 
information. Similar results hold for investment in the stock market. 

• Probability theory . The asymptotic equipartition property (AEP) 
shows that most sequences are typical in that they have a sam¬ 
ple entropy close to H. So attention can be restricted to these 
approximately 2 nH typical sequences. In large deviation theory, the 
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probability of a set is approximately 2- nD , where D is the relative 
entropy distance between the closest element in the set and the true 
distribution. 

• Complexity theory. The Kolmogorov complexity 欠 is a measure of 
the descriptive complexity of an object. It is related to, but different 
from, computational complexity, which measures the time or space 
required for a computation. 

Information-theoretic quantities such as entropy and relative entropy 
arise again and again as the answers to the fundamental questions in 
communication and statistics. Before studying these questions, we shall 
study some of the properties of the answers. We begin in Chapter 2 with 
the definitions and basic properties of entropy, relative entropy, and mutual 
information. 


CHAPTER 2 

ENTROPY, RELATIVE ENTROPY, 
AND MUTUAL INFORMATION 


In this chapter we introduce most of the basic definitions required for 
subsequent development of the theory. It is irresistible to play with their 
relationships and interpretations, taking faith in their later utility. After 
defining entropy and mutual information, we establish chain rules, the 
nonnegativity of mutual information, the data-processing inequality, and 
illustrate these definitions by examining sufficient statistics and Fano’s 
inequality. 

The concept of information is too broad to be captured completely by 
a single definition. However, for any probability distribution, we define a 
quantity called the entropy, which has many properties that agree with the 
intuitive notion of what a measure of information should be. This notion is 
extended to define mutual information ，which is a measure of the amount 
of information one random variable contains about another. Entropy then 
becomes the self-information of a random variable. Mutual information is 
a special case of a more general quantity called relative entropy ，which is 
a measure of the distance between two probability distributions. All these 
quantities are closely related and share a number of simple properties, 
some of which we derive in this chapter. 

In later chapters we show how these quantities arise as natural answers 
to a number of questions in communication, statistics, complexity, and 
gambling. That will be the ultimate test of the value of these definitions. 


2.1 ENTROPY 


We first introduce the concept of entropy, which is a measure of the 
uncertainty of a random variable. Let Z be a discrete random variable 
with alphabet X and probability mass function p(x) = Pr{X = x}, x e X. 
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We denote the probability mass function by p(x) rather than px(x), for 
convenience. Thus, p{x) and p{y) refer to two different random variables 
and are in fact different probability mass functions, px(x) and py(y), 
respectively. 

Definition The entropy H(X) of a discrete random variable X is 
defined by 

//(X) = -^ P (x)logp(x)_ (2.1) 

We also write H(p) for the above quantity. The log is to the base 2 
and entropy is expressed in bits. For example, the entropy of a fair coin 
toss is 1 bit. We will use the convention that 0 log 0 = 0, which is easily 
justified by continuity since x logx 0 as x ^ 0. Adding terms of zero 
probability does not change the entropy. 

If the base of the logarithm is b, we denote the entropy as H^{X). If 
the base of the logarithm is e, the entropy is measured in nats. Unless 
otherwise specified, we will take all logarithms to base 2, and hence all 
the entropies will be measured in bits. Note that entropy is a functional 
of the distribution of X. It does not depend on the actual values taken by 
the random variable X, but only on the probabilities. 

We denote expectation by E. Thus, if X 〜 p(x), the expected value of 
the random variable g(X) is written 

E p g(X)=Y / g(x)p(x), (2.2) 


or more simply as Eg(X) when the probability mass function is under¬ 
stood from the context. We shall take a peculiar interest in the eerily 
self-referential expectation of g(X) under p(x) when g(X) = log 

Remark The entropy of X can also be interpreted as the expected value 
of the random variable log where X is drawn according to probability 
mass function p(x). Thus, 

H(X) = E p log-^. (2.3) 

This definition of entropy is related to the definition of entropy in ther¬ 
modynamics; some of the connections are explored later. It is possible 
to derive the definition of entropy axiomatically by defining certain prop¬ 
erties that the entropy of a random variable must satisfy. This approach 
is illustrated in Problem 2.46. We do not use the axiomatic approach to 
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justify the definition of entropy; instead, we show that it arises as the 
answer to a number of natural questions, such as “What is the average 
length of the shortest description of the random variable?” First, we derive 
some immediate consequences of the definition. 

Lemma 2.1.1 H(X) > 0. 

Proof: 0 < p(x) < 1 implies that log > 0. □ 


Lemma 2.1.2 H b (X) = (log b a)H a (X). 

Proof: log；, p = \og b a log Q p. □ 


The second property of entropy enables us to change the base of the 
logarithm in the definition. Entropy can be changed from one base to 
another by multiplying by the appropriate factor. 


Example 2.7 .1 Let 


1 with probability p, 

0 with probability I — p. 


(2.4) 


Then 


dcf 

H(X) = -plog p - (1 - p) log(l - p) = H(p). 


(2.5) 


In particular, H(X) = 1 bit when p = The graph of the function H(p) 
is shown in Figure 2.1. The figure illustrates some of the basic properties 
of entropy: It is a concave function of the distribution and equals 0 when 
p = 0 or 1. This makes sense, because when p = 0 or 1, the variable 
is not random and there is no uncertainty. Similarly, the uncertainty is 
maximum when p = -，which also corresponds to the maximum value of 
the entropy. 

Example 2.1.2 Let 


a with probability I, 

b with probability*, 

c with probability!, 

d with probability!. 


The entropy of X is 


( 2 . 6 ) 


H(X) 




2 l0 4 




1 1 

4 l0g 4 




1 1 

8 l0g 8 


8 l0g 8 


bits. (2.7) 
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FIGURE 2.1. H(p) vs. p. 


Suppose that we wish to determine the value of X with the minimum 
number of binary questions. An efficient first question is 4< Is X = aT 
This splits the probability in half. If the answer to the first question is 
no, the second question can be 4< Is X = bT The third question can be 
X = c?” The resulting expected number of binary questions required 
is 1.75. This turns out to be the minimum expected number of binary 
questions required to determine the value of X. In Chapter 5 we show that 
the minimum expected number of binary questions required to determine 
X lies between H(X) and H{X) + 1. 


2.2 JOINT ENTROPY AND CONDITIONAL ENTROPY 

We defined the entropy of a single random variable in Section 2.1. We 
now extend the definition to a pair of random variables. There is nothing 
really new in this definition because (X, Y) can be considered to be a 
single vector-valued random variable. 

Definition The joint entropy H(X, Y) of a pair of discrete random 
variables (X, Y) with a joint distribution p(x, j) is defined as 

Y ) = p(x,y) log p(x,y), (2.8) 
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which can also be expressed as 


H(X, Y) = -E\ogp(X, Y). (2.9) 

We also define the conditional entropy of a random variable given 
another as the expected value of the entropies of the conditional distribu¬ 
tions, averaged over the conditioning random variable. 

Definition If (X, F) 〜 p(x ， ;y), the conditional entropy H{Y\X) is 
defined as 


H(Y\X) = J^p(x)H(Y\X = x) 

(2.10) 

xeX 


= P(x) ^ p(y\x)\og p{y\x) 

(2.11) 

xeX yey 


= -^2^P(x,y)log p(y\x) 

(2.12) 

xeXyey 


=-E log p(Y\X). 

(2.13) 


The naturalness of the definition of joint entropy and conditional entropy 
is exhibited by the fact that the entropy of a pair of random variables is 
the entropy of one plus the conditional entropy of the other. This is proved 
in the following theorem. 


Theorem 2.2.1 (Chain rule) 

H(X, Y) = H(X) + H(Y\X). 

Proof 

H(X,Y) = -^2^ p(x,y) log p(x, y) 
xeXyey 

=-X! 尸 ( 义， y) lo g p( x )p(y\ x ) 
xeXyey 


(2.14) 


(2.15) 


(2.16) 


y) lo § pm - p^ x 'y^> lo § 

.[pWlogpW - p(x, y)\og p(y\x) (2.18) 

xeY xeA 1 yey 


=H{X) + H{Y\X). 


(2.19) 
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Similarly, H{Y\X) = ^ bits and H(X, Y) = ^ bits. 

Remark Note that H(Y\X) ^ H(X\Y). However, H(X) - H(X\Y)= 
H(Y)— H{Y\X), a property that we exploit later. 




Equivalently, we can write 

log p(X, Y) = log p(X) + log p(Y\X) (2.20) 

and take the expectation of both sides of the equation to obtain the 
theorem. □ 

Corollary 

H(X ， Y\Z) = H(X\Z) + H(Y\X, Z). (2.21) 

Proof: The proof follows along the same lines as the theorem. □ 
Example 2.2 .7 Let (X, F) have the following joint distribution: 


X 

1 

2 

3 

4 

1 

l 

8 

l 

T6 

l 

32 

l 

32 

2 

1 

T6 

1 

8 

1 

32 

1 

32 

3 

1 

T6 

1 

T6 

1 

f6 

1 

T6 

4 

1 

4 

0 

0 

0 


The marginal distribution of X is (H，and the marginal distribution 
of Y is (|, i), and hence H(X) = ^ bits and H{Y) = 2 bits. Also, 

4 

H(X\Y) = J2p(Y = i)H(X\Y = i) (2.22) 
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2.3 RELATIVE ENTROPY AND MUTUAL INFORMATION 

The entropy of a random variable is a measure of the uncertainty of the 
random variable; it is a measure of the amount of information required on 
the average to describe the random variable. In this section we introduce 
two related concepts: relative entropy and mutual information. 

The relative entropy is a measure of the distance between two distribu¬ 
tions. In statistics, it arises as an expected logarithm of the likelihood ratio. 
The relative entropy D(p\\q) is a measure of the inefficiency of assuming 
that the distribution is q when the true distribution is p. For example, if 
we knew the true distribution p of the random variable, we could con¬ 
struct a code with average description length H(p). If, instead, we used 
the code for a distribution q, we would need H(p) + D(p\\q) bits on the 
average to describe the random variable. 

Definition The relative entropy or Kullback-Leibler distance between 
two probability mass functions p{x) and q{x) is defined as 


P(x) 


D{p\\q) = ^ p{x)\og 


(2.26) 


q(x) 



(2.27) 


In the above definition, we use the convention that 0 log g = 0 and the 
convention (based on continuity arguments) that 0 log ^ = 0 and p log 吾 = 
oc. Thus, if there is any symbol x e X such that p(x) > 0 and q(x) = 0, 
then D(p\\q) = oo. 

We will soon show that relative entropy is always nonnegative and is 
zero if and only if p = q. However, it is not a true distance between 
distributions since it is not symmetric and does not satisfy the triangle 
inequality. Nonetheless, it is often useful to think of relative entropy as a 
“distance” between distributions. 

We now introduce mutual information, which is a measure of the 
amount of information that one random variable contains about another 
random variable. It is the reduction in the uncertainty of one random 
variable due to the knowledge of the other. 


Definition Consider two random variables X and Y with a joint proba¬ 
bility mass function p(x, y) and marginal probability mass functions p(x) 
and p{y). The mutual information Y) is the relative entropy between 
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the joint distribution and the product distribution p(x)p(y): 


/(X; Y) = J^J^p(x,y)\og P . {X ' y \ 

(2.28) 

= D(p(x, y)\\p(x)p(y)) 

(2.29) 

「 , P(X,Y) 

^ pi ^ y) g p(X)p(YY 

(2.30) 


In Chapter 8 we generalize this definition to continuous random vari¬ 
ables, and in (8.54) to general random variables that could be a mixture 
of discrete and continuous random variables. 


Example 2.3.1 Let X = {0, 1} and consider two distributions p and q 
on X. Let p(0) = 1 — r ， p(l) = r, and let ^(0) = 1 — 5, ^(1) = s. Then 

D(p\\q) = (1 - r)log^~- + r log - (2.31) 

l — s s 

and 

D{q\\p) = (1 -s)log^~ - +slog-. (2.32) 

1 — r r 

If r = 5, then D{p\\q) = D{q\\p) = 0. If r = 士， 5* = *， we can calculate 


D(p\\q) = ^ log j + ^ log y = 1 - ^ log3 = 0.2075 bit, 

4 4 


2 


2 


whereas 


D(q\\p) = -log{ + -log{ = -log3-1 = 0.1887 bit. 

T" — T" — T" 


Note that D(p\\q) ^ D(q\\p) in general. 


(2.33) 


(2.34) 


2.4 RELATIONSHIP BETWEEN ENTROPY AND MUTUAL 
INFORMATION 


We can rewrite the definition of mutual information I{X\ Y) as 

P(x, y) 


I(X ； Y) = J^p(x,y) log 


x,y 


P(x)p(y) 


(2.35) 
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^p(x,y) log 


x,y 


P(x\y) 

P(x) 


P^ x ' lo g 户⑴ + L y) ^og p(x\y) 


x,y 


x,y 


(2.36) 

(2.37) 


=—^pOOlog 尸⑴一 1 —f 〆 乂， yHogpCxI)；) j(2.38) 

x \ x,y / 

= H(X)-H(X\Y). (2.39) 

Thus, the mutual information I (X; Y) is the reduction in the uncertainty 
of X due to the knowledge of Y. 

By symmetry, it also follows that 


I(X; Y) = H(Y) - H{Y\X). (2.40) 

Thus, X says as much about I 7 as F says about X. 

Since H(X, Y) = H{X) + H(Y\X), as shown in Section 2.2, we have 

I{X\ Y) = H(X) + H{Y) - H(X, Y). (2.41) 

Finally, we note that 

I{X\X) = H(X) - H(X\X) = H(X). (2.42) 


Thus, the mutual information of a random variable with itself is the 
entropy of the random variable. This is the reason that entropy is some¬ 
times referred to as self-information. 

Collecting these results, we have the following theorem. 


Theorem 2.4.1 {Mutual information and entropy) 

I{X\ Y) = H(X) - H(X\Y) (2.43) 

I{X\ Y) = H{Y) - H{Y\X) (2.44) 

/(X; Y) = H(X) + H(Y) - H(X, Y) (2.45) 

I{X'Y) = I{Y'X) (2.46) 

/(X; X) = H(X). (2.47) 
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H(X,Y) 



FIGURE 2.2. Relationship between entropy and mutual information. 

The relationship between H(X), H(Y), H(X, Y), H(X\Y), H(Y\X) 9 
and I{X\ Y) is expressed in a Venn diagram (Figure 2.2). Notice that 
the mutual information /(Z; Y) corresponds to the intersection of the 
information in X with the information in Y. 

Example 2.4.1 For the joint distribution of Example 2.2.1, it is easy to 
calculate the mutual information /(X; Y) = H(X) — H{X\Y) = H(Y) — 
H{Y\X) = 0.375 bit. 

2.5 CHAIN RULES FOR ENTROPY, RELATIVE ENTROPY, 

AND MUTUAL INFORMATION 

We now show that the entropy of a collection of random variables is the 
sum of the conditional entropies. 

Theorem 2.5.1 (Chain rule for entropy) Let X\, X 2 ,..., X n be drawn 
according to p(x\, X 2 , •.., x n ). Then 

n 

H(X U X 2 ,..., X n ) = Zi). (2.48) 

i = \ 

Proof: By repeated application of the two-variable expansion rule for 
entropies, we have 


H(X u X 2 ) = H(X l ) + H(X 2 \X 1 ), 
H(X u X 2 ， X 3 ) = H(X l ) + H(X 2 , X 3 \X x ) 


(2.49) 

(2.50) 
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= H{X x ) + H(X 2 \X l ) + H(X 3 \X 2 , Xi), (2.51) 


H{X U X 2 ,..., X n ) = H(X l ) + H{X 2 \X x ) + •■• + H{X n \X n _ u ...,X x ) 

(2.52) 

n 

= ( 足 ■]，._•，&)•□ (2.53) 

i=\ 

Alternative Proof: We write p(x\ f . • • ， x n ) = YYi=\ P( x i\ x i-i^ • • • ， $ 1 ) 
and evaluate 

H(X u X 2 ,...,X n ) 


— ^ p(x\,x 2 , x n ) log p(Xl, X 2 ,..., Xn) 


(2.54) 

n 

- ^ P(X\,X 2 , . . . ,X n )\ogY[pi.Xi\Xi-\, ... 

^i ， X2，---,Xn / =1 

,Xl) 

(2.55) 

- ^ ^ p(Xl,X 2 , X n ) log p(Xj\Xj^l,... 

xi,X2,---,x n i=l 

■,Xl) 

(2.56) 

- ^2 ^2 P(Xl,X 2 , ...,X n ) log p(Xi\Xi-l,... 

i = l Xl,X2,■■■jXfi 

■,Xl) 

(2.57) 

- p(xi,x 2 ,x^iogp(xi\xi-u... 

i=l xi ， X2，---,Xi 

,Xl) 

(2.58) 

i=l 


(2.59) 


We now define the conditional mutual information as the reduction in 
the uncertainty of X due to knowledge of Y when Z is given. 

Definition The conditional mutual information of random variables X 
and Y given Z is defined by 


/(X; Y\Z) = H(X\Z) - H(X\Y, Z) 

= W)lo g ^ y|Z) 


p{X\Z)p(Y\Z) 


(2.60) 

(2.61) 


Mutual information also satisfies a chain rule. 
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Theorem 2.5.2 (Chain rule for information) 


I (X u X 2 ,...,X n -Y) = y|X / _i,X i _ 2 ,..., ZO, (2.62) 


Proof 


I(X u X 2 ,...,X n -Y) 

= H(X U X 2 , … ， X n ) - H{X U X 2 ,..., X„\Y) (2.63) 

n n 

=Y, H ( X i …， D H(Xi IXm ， ...,X u Y) 


E/(U|X 1 ， X 2 ， .._ ， X,_ 1 ). 


□ 


(2.64) 


We define a conditional version of the relative entropy. 

Definition For joint probability mass functions p(x, y) and q(x, y), the 
conditional relative entropy D{p{y\x)\\q{y\x)) is the average of the rela¬ 
tive entropies between the conditional probability mass functions p{y\x) 
and q{y\x) averaged over the probability mass function p{x). More pre¬ 
cisely, 


D(p(y\x)\\q(y\x)) = ^ p(x) ^ p(y\x) log 


p(y\x) 

q(y\x) 


E p(x,y) log 


P(Y\X) 

q(Y\X)' 


(2.65) 


( 2 . 66 ) 


The notation for conditional relative entropy is not explicit since it omits 
mention of the distribution p(x) of the conditioning random variable. 
However, it is normally understood from the context. 

The relative entropy between two joint distributions on a pair of ran¬ 
dom variables can be expanded as the sum of a relative entropy and a 
conditional relative entropy. The chain rule for relative entropy is used in 
Section 4.4 to prove a version of the second law of thermodynamics. 

Theorem 2.5.3 (Chain rule for relative entropy) 


D(p(x, y)\\q{x, j)) = D(p(x)\\q(x)) + D(p(y\x)\\q(y\x)). (2.67) 




Proof 


2.6 
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D(p(x, y)\\q(x, y)) 

= y) lo § 

x y 

= ^2^2 p( x ^ y) lo § 

x y 

=J)log 


p(x, y) 
q(x, y) 

p(x)p(y\x) 

q(x)q(y\x) 

P(x) 

q(x) 


J^J^p(x,y) log 


x y - 1 x y 

D{p{x)\\q{x)) + D(p(y\x)\\q(y\x)). □ 


p(y\x) 

q(y\x) 


( 2 . 68 ) 

(2.69) 

(2.70) 

(2.71) 


2.6 JENSEN'S INEQUALITY AND ITS CONSEQUENCES 

In this section we prove some simple properties of the quantities defined 
earlier. We begin with the properties of convex functions. 

Definition A function f{x) is said to be convex over an interval (a, b) 
if for every x\,X 2 G (a, b) and 0 $ 入 $ 1 ， 

/( 入 xi + (1 — ^)x 2 ) < Xf {x\) + (1 一入 ) / ( 尤 2 ). (2.72) 

A function / is said to be strictly convex if equality holds only if 入 = 0 
or 入 =1. 

Definition A function / is concave if —/ is convex. A function is 
convex if it always lies below any chord. A function is concave if it 
always lies above any chord. 

Examples of convex functions include x 2 ， \x\, e x , x logx (for x > 
0), and so on. Examples of concave functions include logx and sfx for 
x > 0. Figure 2.3 shows some examples of convex and concave functions. 
Note that linear functions ax b are both convex and concave. Convexity 
underlies many of the basic properties of information-theoretic quantities 
such as entropy and mutual information. Before we prove some of these 
properties, we derive some simple results for convex functions. 

Theorem 2.6.1 If the function f has a second derivative that is non¬ 
negative (positive) over an interval, the function is convex (strictly convex) 
over that interval. 
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⑻ 




(b) 

FIGURE 2.3. Examples of (a) convex and (b) concave functions. 

Proof: We use the Taylor series expansion of the function around xo: 
f(x) = f(xo) + f(xo)(x - Xo) + ^ ^ \x — x 0 ) 2 , (2.73) 

where x* lies between xo and x. By hypothesis, / /r (x*) > 0, and thus 
the last term is nonnegative for all x. 

We let xo = Xx\ + (1 — X)x 2 and take x = x\,io obtain 

fix,) > /(xo) + f\x 0 m - 入 ）( X1 - x 2 )). (2.74) 

Similarly, taking x = X 2 , we obtain 

/fe) > f(xo) + f\xo)(Hx 2 - xi)). (2.75) 

Multiplying (2.74) by X and (2.75) by 1 — 入 and adding, we obtain (2.72). 
The proof for strict convexity proceeds along the same lines. □ 

Theorem 2.6.1 allows us immediately to verify the strict convexity of 
x 2 , e x , and x logx for x > 0, and the strict concavity of logx and for 
x > 0. 

Let E denote expectation. Thus, EX = p{x)x in the discrete 

case and EX = f xf{x)dx in the continuous case. 
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The next inequality is one of the most widely used in mathematics and 
one that underlies many of the basic results in information theory. 

Theorem 2.6.2 {Jensen’s inequality) If f is a convex function and 
X is a random variable, 


Ef(X) > f(EX). (2.76) 


Moreover, if f is strictly convex, the equality in (2.76) implies that 
X = EX with probability 1 (i.e” X is a constant). 

Proof: We prove this for discrete distributions by induction on the num¬ 
ber of mass points. The proof of conditions for equality when / is strictly 
convex is left to the reader. 

For a two-mass-point distribution, the inequality becomes 


+ P2f(x 2 ) > f{p\X\ + p 2 x 2 ), (2.77) 

which follows directly from the definition of convex functions. Suppose 
that the theorem is true for distributions with k — l mass points. Then 
writing p; = Pi/(\ — Pk) for i = 1, 2,..., ^ — 1, we have 

k k-l 

= Pkf(Xk) + (1 - Pk) Y] ( 2 . 78 ) 

i=l i=\ 


/k-l 


> Pkf{x k ) + (i - p k )f [Y^Pi 


Xi 


k-l 


> / ( PkXk + (1 - p^^p'iXi 


/ E 


PiXi , 


(2.79) 

(2.80) 

(2.81) 


where the first inequality follows from the induction hypothesis and the 
second follows from the definition of convexity. 

The proof can be extended to continuous distributions by continuity 
arguments. □ 


We now use these results to prove some of the properties of entropy and 
relative entropy. The following theorem is of fundamental importance. 
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Theorem 2.6.3 {Information inequality) Let p(x), q{x), x e X } be 
two probability mass functions. Then 

D{p\\q)>Q (2.82) 


with equality if and only if p(x) = q(x) for all x. 

Proof: Let A = {x : p(x) > 0} be the support set of p(x). Then 


D(p\\q) - J^p(JC)log , 、 

(2.83) 

-J]p(x)log 

^ p ( x ) 

(2.84) 

<logYp(x) q(x) 

_ 会 尸⑴ 

(2.85) 

= ^og^q(x) 

(2.86) 

xeA 


< log^qix) 

(2.87) 

xeX 


= logl 

(2.88) 

= 0, 

(2.89) 


where (2.85) follows from Jensen’s inequality. Since log? is a strictly 
concave function of t, we have equality in (2.85) if and only if q(x)/p(x) 
is constant everywhere [i.e., q{x) = cp(x) for all x]. Thus, q{x)= 

C E.V6A p( x ) = c - We have equality in (2.87) only if J^xeA 0 W = HxeX 
q(x) = 1, which implies that c = 1. Hence, we have D(p\\q) = 0 if and 
only if p(x) = q{x) for all x. □ 

Corollary {Nonnegativity of mutual information) For any two random 
variables, X, Y, 

I{X\ F)>0, (2.90) 

with equality if and only if X and Y are independent. 

Proof: I(X;Y) = D(p(x, y)\\p(x)p(y)) > 0, with equality if and only 
if p(x, y) = p{x)p{y) (i.e., X and Y are independent). □ 
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Corollary 

D{p{y\x)\\q{y\x)) >0, (2.91) 

with equality if and only if p{y\x) = q{y\x) for all y and x such that 
p{x) > 0. 

Corollary 

/(X;F|Z) >0, (2.92) 

with equality if and only if X and Y are conditionally independent given Z. 

We now show that the uniform distribution over the range X is the 
maximum entropy distribution over this range. It follows that any random 
variable with this range has an entropy no greater than log \X\. 

Theorem 2.6.4 H(X) < log \X\, where \X\ denotes the number of ele¬ 
ments in the range of X, with equality if and only X has a uniform distri¬ 
bution over X. 

Proof: Let u{x) = be the uniform probability mass function over X, 
and let p(x) be the probability mass function for X. Then 

D(P II «) = L p(x)log = log|Al - H(X). (2.93) 

Hence by the nonnegativity of relative entropy, 

0 < D(p || w) = log |^| - H(X). □ (2.94) 

Theorem 2.6.5 {Conditioning reduces entropy){Information can't hurt) 

H(X\Y) < H(X) (2.95) 

with equality if and only if X and Y are independent. 

Proof: 0 < /(X; Y) = H(X) - H(X\Y). □ 

Intuitively, the theorem says that knowing another random variable Y 
can only reduce the uncertainty in X. Note that this is true only on the 
average. Specifically, H{X\Y = y) may be greater than or less than or 
equal to H(X), but on the average H{X\Y) = P(y)H(X\Y = y) < 
H{X). For example, in a court case, specific new evidence might increase 
uncertainty, but on the average evidence decreases uncertainty. 
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Example 2.6.1 Let (X, Y) have the following joint distribution: 



1)=0 bits, and 
lH(X\Y = l) + \ 


Then H(X) = //(|, 

H(X\Y = 2) = 1 bit. We calculate H(X\Y) 

H(X\Y = 2) = 0.25 bit. Thus, the uncertainty in X is increased if Y = 2 
is observed and decreased if F = 1 is observed, but uncertainty decreases 
on the average. 


Theorem 2.6.6 {Independence bound on entropy) Let 

X\, X 2 , … ， X n be drawn according to p{x\,X 2 ,..., x n ). Then 


H{X u X 2 ,...,X n )<Y J H{X i ) (2.96) 

/ = 1 

with equality if and only if the Xi are independent. 

Proof: By the chain rule for entropies, 

n 

H(X U X 2 ,...,X n ) = Y J (2.97) 

i=\ 

n 

(2.98) 


where the inequality follows directly from Theorem 2.6.5. We have equal¬ 
ity if and only if Xi is independent of Xi-\,..., X\ for all i (i.e., if and 
only if the X/’s are independent). □ 


2.7 LOG SUM INEQUALITY AND ITS APPLICATIONS 

We now prove a simple consequence of the concavity of the logarithm, 
which will be used to prove some concavity results for the entropy. 
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Theorem 2.7.1 (Log sum inequality) For nonnegative numbers, 
ci\, ci 2 , • • •, ci n and Z? 2 , ••” b n ， 


二 a,■ log 




bt - 



log 


rUa, 

E； ! =1 b, 


(2.99) 


with equality if and only if 亡 =const. 

We again use the convention that 0 log 0 = 0, a log g = oc if a > 0 and 
Olog g = 0. These follow easily from continuity. 

Proof: Assume without loss of generality that a/ > 0 and b[ > 0. The 
function f{t) = t log t is strictly convex, since f f \t) = j log e > 0 for all 
positive t. Hence by Jensen’s inequality, we have 




diti 


( 2 . 100 ) 


for at > 0, oii = \. Setting a/ 


. and t ； = we obtain 

[j=\ b j b i 


E 


at at 

^ log h- 




ai 




which is the log sum inequality. 


( 2 . 101 ) 

□ 


We now use the log sum inequality to prove various convexity results. 
We begin by reproving Theorem 2.6.3, which states that D(p\\q) > 0 with 
equality if and only if p(x) = q{x). By the log sum inequality, 

D{p\\q) = ^p(x)\og^j^- ( 2 . 102 ) 

> \og^p{x) j (2.103) 

1 

=1 log ^ = 0 (2.104) 

with equality if and only if = c. Since both p and q are probability 
mass functions, c = 1, and hence we have D(p\\q) = 0 if and only if 
p(x) = q(x) for all x. 
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Theorem 2.7.2 (Convexity of relative entropy) D{p\\q) is convex in 
the pair (p, q); that is, if(p\, q\) and (P 2 , qi) are two pairs of probability 
mass functions，then 


D{Xp\ + (1 — 入 ) P 2 II 入 + (1 — 入 ) ？ 2) S 入 + (1 — X)D{p2\\q2) 

(2.105) 


for all 0 < X < l. 


Proof: We apply the log sum inequality to a term on the left-hand side 
of (2.105): 


(入 Pi 0) + (1 - 入 ) P200)log 
入 

< 入 Pi O) log ；~~— + (! 


入仍 ⑴ + (1 - 入 ) P2W 


入仍 ( 义 ） + (1 
- 入 ) P20)l0g 


入)％ w 
(1 - 入 ) P 2 W 


入 (1 - X)q 2 (x) 
Summing this over all x, we obtain the desired property. 


(2.106) 

□ 


Theorem 2.7.3 (Concavity of entropy) H(p) is a concave function 
of P. 

Proof 


H(p) =\og\X\ - D(p\\u), (2.107) 

where u is the uniform distribution on \X\ outcomes. The concavity of H 
then follows directly from the convexity of D. □ 


Alternative Proof: Let X\ be a random variable with distribution pu 
taking on values in a set A. Let X 2 be another random variable with 
distribution P 2 on the same set. Let 


9 


1 with probability 入， 

2 with probability 1 — A. 


Let Z = Xq. Then the distribution of Z is Xp\ + (1 
conditioning reduces entropy, we have 


(2.108) 


入 )P 2 . Now since 


H(Z)> H(Z\9), 


(2.109) 


or equivalently, 

H(X Pl + (1 — X)p 2 ) > 入 //( 仍 ） + (1 - (2.110) 

which proves the concavity of the entropy as a function of the distribution. 

□ 
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One of the consequences of the concavity of entropy is that mixing two 
gases of equal entropy results in a gas with higher entropy. 

Theorem 2.7.4 Ld (X ， y) 〜 p(x, y) = p{x)p{y\x). The mutual infor¬ 
mation I{X\ Y) is a concave function of p{x) for fixed p{y\x) and a convex 
function of p(y\x) for fixed p(x). 

Proof: To prove the first part, we expand the mutual information 
I(X; Y) = H(Y) - H(Y\X) = H{Y) - p(x)H(Y\X = x). (2.111) 

X 

If p(y\x) is fixed, then p(y) is a linear function of p(x). Hence H(Y), 
which is a concave function of p(y), is a concave function of p(x). The 
second term is a linear function of p(x). Hence, the difference is a concave 
function of p(x). 

To prove the second part, we fix p{x) and consider two different con¬ 
ditional distributions p\(y\x) and p 2 (y\x). The corresponding joint dis¬ 
tributions are p\(x, j) = p{x)p\{y\x) and P 2 (x, y) = p(x)p 2 (y\x), and 
their respective marginals are p(x), p\{y) and p(x), P 2 (y)- Consider a 
conditional distribution 

Px(y\x) = ^p\(y\x) + (1 - ^)P 2 (y\x), (2.112) 

which is a mixture of p\(y\x) and P 2 (y\x) where 0 $ 入 $ 1. The cor¬ 
responding joint distribution is also a mixture of the corresponding joint 
distributions, 

P\(x, y) = A.jOi(x, j) + (1 - ^.)P 2 (x, y), (2.113) 

and the distribution of Y is also a mixture, 

Px(y) = ViOO + (l — 入 ) P 2 (j). (2.114) 

Hence if we let y) = p{x)pxiy) be the product of the marginal 
distributions, we have 

q x (x, y) = Xq\(x, j) + (1 - y). (2.115) 

Since the mutual information is the relative entropy between the joint 
distribution and the product of the marginals, 

I(X- Y) = D(p x (x, y)\\qx(x,y)), (2.116) 

and relative entropy D(p\\q) is a convex function of (p, q), it follows that 
the mutual information is a convex function of the conditional distribution. 

□ 
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2.8 DATA PROCESSING INEQUALITY 

The data-processing inequality can be used to show that no clever manip¬ 
ulation of the data can improve the inferences that can be made from 
the data. 

Definition Random variables X,Y,Z are said to form a Markov chain 
in that order (denoted by X Y Z) if the conditional distribution of 
Z depends only on Y and is conditionally independent of X. Specifically, 
X, Y, and Z form a Markov chain X ^ F ^ Z if the joint probability 
mass function can be written as 

p(x, y,z) = p(x)p(y\x)p(z\y). (2.117) 


Some simple consequences are as follows: 


X ^ Y ^ Z if and only if X and Z are conditionally independent 
given Y. Markovity implies conditional independence because 


P(x,z\y) 


p{x, y, z) p(x, y)p{z\y) 


p(y) 


p(y) 


p(x\y)p(z\y). (2.118) 


This is the characterization of Markov chains that can be extended 
to define Markov fields, which are n-dimensional random processes 
in which the interior and exterior are independent given the values 
on the boundary. 

• X —> F — > Z implies that Z ^ Y ^ X. Thus, the condition is some¬ 
times written X ^ Y ^ Z. 

• If Z = /(I 7 )， then X^Y^Z. 


We can now prove an important and useful theorem demonstrating that 
no processing of Y, deterministic or random, can increase the information 
that Y contains about X. 


Theorem 2.8.1 (Data-processing inequality) IfX — > r —> Z ，then 
/(Z; Y)>I(X; Z). 

Proof: By the chain rule, we can expand mutual information in two 
different ways: 


/(X; 7, Z) = /(X; Z) + /(X;F|Z) 
= I(X;Y) + I(X;Z\Y) 


(2.119) 

( 2 . 120 ) 
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Since X and Z are conditionally independent given Y, we have 
I(X; Z\Y) = 0. Since /(X; Y\Z) > 0, we have 

/(X;F)>/(Z; Z). (2.121) 

We have equality if and only if I{X\ Y\Z) = 0 (i.e., X ^ Z ^ Y forms 
a Markov chain). Similarly, one can prove that 7(7; Z) > I{X\ Z). □ 

Corollary In particular, if Z = g(Y) f we have I{X\ Y) > I{X\ g(F)). 

Proof: X —> F —> g(F) forms a Markov chain. □ 

Thus functions of the data Y cannot increase the information about X. 

Corollary If X ^ Y ^ Z, then 1{X\ Y\Z) < /(X; Y). 

Proof: We note in (2.119) and (2.120) that I(X]Z\Y)=0, by 
Markovity, and I{X\ Z) > 0. Thus, 

/(X; Y\Z) < /(X; Y). □ (2.122) 

Thus, the dependence of X and Y is decreased (or remains unchanged) 
by the observation of a “downstream” random variable Z. Note that it is 
also possible that I(X; Y\Z) > I{X\ Y) when X, Y, and Z do not form a 
Markov chain. For example, let X and Y be independent fair binary ran¬ 
dom variables, and let Z = X + Y. Then /(X; Y) = 0, but Y\Z)= 
H(X\Z) - H(X\Y, Z) = H(X\Z) = P(Z = l)H(X\Z = 1) = ^ bit. 

2.9 SUFFICIENT STATISTICS 

This section is a sidelight showing the power of the data-processing 
inequality in clarifying an important idea in statistics. Suppose that we 
have a family of probability mass functions {fe(x)} indexed by 0, and let 
X be a sample from a distribution in this family. Let T (X) be any statistic 
(function of the sample) like the sample mean or sample variance. Then 
0 —> X —> T (X), and by the data-processing inequality, we have 

/(0 ； r(X))</(0;X) (2.123) 

for any distribution on 9. However, if equality holds, no information 
is lost. 

A statistic T{X) is called sufficient for 0 if it contains all the infor¬ 
mation in X about 9. 
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Definition A function T (X) is said to be a sufficient statistic relative to 
the family {fe(x)} if X is independent of 6 given T (X) for any distribution 
on 0[i.e., 6 ^ T (X) —> X forms a Markov chain]. 

This is the same as the condition for equality in the data-processing 
inequality, 


I{0\X) = 1(6; T(X)) (2.124) 


for all distributions on 0. Hence sufficient statistics preserve mutual infor¬ 
mation and conversely. 

Here are some examples of sufficient statistics: 


1. Let X\, X2, …， X n , Xi g {0, 1}, be an independent and identically 
distributed (i.i.d.) sequence of coin tosses of a coin with unknown 
parameter 6 = Pr(X/ = 1). Given n, the number of Ts is a sufficient 
statistic for 6. Here T (X\, X2 ,..., X n ) = YH=i 义 /. In fact, we can 
show that given T, all sequences having that many l’s are equally 
likely and independent of the parameter 9. Specifically, 


Pr i (_X"i ， _X*2，•. • ， = ( 又 1 ， X2，. • • ，义 《) 〉: Xi = k 

=\ (!) (2.125) 

0 otherwise. 

Thus, 9 —> Xi —> (X\, X 2 ,..., X n ) forms a Markov chain, and 

7 is a sufficient statistic for 9. 

The next two examples involve probability densities instead of 
probability mass functions, but the theory still applies. We define 
entropy and mutual information for continuous random variables in 
Chapter 8. 

2. If X is normally distributed with mean 9 and variance 1; that is, if 

fe(x) = = MO, 1), (2.126) 

V2tt 

and X\, X 2 ,..., X n are drawn independently according to this distri¬ 
bution, a sufficient statistic for 9 is the sample mean X n = * Xi. 
It can be verified that the conditional distribution of X\, X 2 ,..., X n , 
conditioned on X n and n does not depend on 6. 
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3. If fe = Uniform(0, 0 + 1), a sufficient statistic for 6 is 
T(X l ,X 2 ,...,X n ) 

=(max{X u X 2 , …， X n }, min{Xi, X 2 , …， X n }). (2.127) 

The proof of this is slightly more complicated, but again one can 
show that the distribution of the data is independent of the parameter 
given the statistic T. 

The minimal sufficient statistic is a sufficient statistic that is a function 
of all other sufficient statistics. 

Definition A statistic T (X) is a minimal sufficient statistic relative to 
{fo(x)} if it is a function of every other sufficient statistic U. Interpreting 
this in terms of the data-processing inequality, this implies that 

9 T(X) U(X) X. (2.128) 


Hence, a minimal sufficient statistic maximally compresses the infor¬ 
mation about 6 in the sample. Other sufficient statistics may contain 
additional irrelevant information. For example, for a normal distribution 
with mean 0, the pair of functions giving the mean of all odd samples and 
the mean of all even samples is a sufficient statistic, but not a minimal 
sufficient statistic. In the preceding examples, the sufficient statistics are 
also minimal. 


2.10 FANO，S INEQUALITY 

Suppose that we know a random variable Y and we wish to guess the value 
of a correlated random variable X. Fano’s inequality relates the probabil¬ 
ity of error in guessing the random variable X to its conditional entropy 
H(X\Y). It will be crucial in proving the converse to Shannon’s channel 
capacity theorem in Chapter 7. From Problem 2.5 we know that the con¬ 
ditional entropy of a random variable X given another random variable 

Y is zero if and only if X is a function of Y. Hence we can estimate X 
from Y with zero probability of error if and only if H(X\Y) = 0. 

Extending this argument, we expect to be able to estimate X with a 
low probability of error only if the conditional entropy H{X\Y) is small. 
Fano’s inequality quantifies this idea. Suppose that we wish to estimate a 
random variable X with a distribution p(x). We observe a random variable 

Y that is related to X by the conditional distribution p(y\x). From Y, we 
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calculate a function g{Y) = X, where X is an estimate of X and takes on 
values in X. We will not restrict the alphabet X to be equal to X, and we 
will also allow the function g(F) to be random. We wish to bound the 
probability that X ^ X. We observe that X ^ Y ^ X forms a Markov 
chain. Define the probability of error 

P e =Pr{X^ X}. (2.129) 

Theorem 2.10.1 (Fano’s Inequality) For any estimator X such that 
X ^ Y ^ X, with P e = Pr(X ^ X), we have 

H(P e ) + P e log\X\ > H(X\X) > H(X\Y). (2.130) 

This inequality can be weakened to 

l + P,log|^| > H{X\Y) (2.131) 


H{X\Y)-\ 
e _ logl^l 

Remark Note from (2.130) that P e = 0 implies that H(X\Y) 
intuition suggests. 


(2.132) 
= 0, as 


Proof: We first ignore the role of Y and prove the first inequality in 
(2.130). We will then use the data-processing inequality to prove the more 
traditional form of Fano’s inequality, given by the second inequality in 
(2.130). Define an error random variable, 




(2.133) 


Then, using the chain rule for entropies to expand H(E, X\X) in two 
different ways, we have 

H(E, X\X) = H(X\X) + H(E\X, X) (2.134) 

V - V - 1 

=0 

= H(E\X) + H(X\E,X). (2.135) 

、 V ’ 、 V ^ 

<H(P e ) <Pe\og\X\ 


Since conditioning reduces entropy, H{E\X) < H(E) = H(P e ). Now 
since £ is a function of X and X, the conditional entropy H(E\X, X) is 
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equal to 0. Also, since £ is a binary-valued random variable, H{E)= 
H(P e ). The remaining term, H(X\E, X), can be bounded as follows: 

H(X\E,X) = Pr ( 五 = 0)7/(X|X, £ = 0) + Pr ( 五 = l)H(X\X, E = l) 
<(1-P,)0 + P,log|^|, (2.136) 

since given E = 0, X = X, and given E = l, we can upper bound the 
conditional entropy by the log of the number of possible outcomes. Com¬ 
bining these results, we obtain 

H(P e ) + P e log\X\ > H(X\X). (2.137) 

By the data-processing inequality, we have I{X\ X) < I{X\ Y) since 
X —> F —> X is a Markov chain, and therefore H(X\X) > H(X\Y). Thus, 
we have 

H(P e ) + P e log\X\ > H(X\X) > H(X\Y). □ (2.138) 

Corollary For any two random variables X and Y, let p = Pr(X ^ Y). 

H{p) + p\og\X\ > H(X\Y). (2.139) 

Proof: Let X = Y in Fano’s inequality. □ 

For any two random variables X and Y, if the estimator g(y) takes 
values in the set X, we can strengthen the inequality slightly by replacing 
log \X\ with log(|Al — 1). 

Corollary Let P e = Pr(X ^ X), and let X : y ^ X; then 

H(P e ) + P,log(|Al - 1) > H(X\Y). (2.140) 

Proof: The proof of the theorem goes through without change, except 
that 

H(X\E,X) = Pr ( 五 = 0)//(X|X, £ = 0) + Pr (五 = l)H(X\X, E = l) 

(2.141) 

<d - Pe)0 + Pe logd^l - 1), (2.142) 

since given E = 0, X = X, and given E = l, the range of possible X 
outcomes is \X\ — l, we can upper bound the conditional entropy by the 
log(|Al — 1), the logarithm of the number of possible outcomes. Substi¬ 
tuting this provides us with the stronger inequality. □ 
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Remark Suppose that there is no knowledge of Y. Thus, X must be 
guessed without any information. Let X e {1, 2,..., m} and p\> P 2 > 
> p m . Then the best guess of X is X = 1 and the resulting probability 
of error is P e = l — p\. Fano’s inequality becomes 

H{P e ) + P e \og{m -1)> H(X). (2.143) 


The probability mass function 


(Pi ， P2, • • . ， Pm) = 




(2.144) 


achieves this bound with equality. Thus, Fano’s inequality is sharp. 

While we are at it, let us introduce a new inequality relating probability 
of error and entropy. Let X and X’ by two independent identically dis¬ 
tributed random variables with entropy H(X). The probability at X = X’ 
is given by 


Pr(Z = X f ) = Y^ P 2 ⑴. (2.145) 

X 

We have the following inequality: 


Lemma 2.10.1 If X and X' are i.i.d. with entropy H(X )， 

Pr(X = X') > 2~ h(x) , (2.146) 

with equality if and only if X has a uniform distribution. 

Proof: Suppose that X ~ p{x). By Jensen’s inequality, we have 


2 们 ogp(x) < ^2 l °sp( x )^ 


(2.147) 


which implies that 

2~h(X) _ ⑴ log；? ⑴ < E p(x)2 iogpW = □ (2.148) 

Corollary Let X, X f be independent with X 〜 p[x)’ X f ~ r{x), x, x' G 
X. Then 

Pr(Z = x') > 2 - H(p) - D(p ^ r) , (2.149) 

Pr(X = X') > 2~ h ^- d ^p\ (2.150) 




Proof: We have 
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2~H(p)-D(p\\r) 


_ 2Ep( x '> 1 °spM+J2 

(2.151) 

― 2T,P( x ) lo Sr(x) 

(2.152) 

<J^p(x)2 l0 ^ x) 

(2.153) 

= ^P(x)r(x) 

(2.154) 

= Pr(Z = X'), 

(2.155) 


where the inequality follows from Jensen’s inequality and the convexity 
of the function f(y) = 2 y . □ 

The following telegraphic summary omits qualifying conditions. 


SUMMARY 

Definition The entropy H(X) of a discrete random variable X is 
defined by 

H(X) = -^2p(x)logp(x). (2.156) 

Properties of H 

1. H(X) > 0. 

2. H b (X) = (log b a)H a (X). 

3. (Conditioning reduces entropy) For any two random variables, X 
and I 7 , we have 

H(X\Y) < H(X) (2.157) 

with equality if and only if X and Y are independent. 

4. H(Xi, X 2 , ••• ， X n ) < Y^i=\ H(Xi), with equality if and only if the 
Xi are independent. 

5. H(X) < log I X I, with equality if and only if X is distributed uni¬ 
formly over X. 

6. H(p) is concave in p. 
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Definition The relative entropy D(p || q) of the probability mass 
function p with respect to the probability mass function q is defined by 

D(p || ?) = V p(x)log^^-. (2.158) 

q(x) 

Definition The mutual information between two random variables X 
and Y is defined as 


i(X-Y) = J2J^P(x,y)\og 

xe^yey 


P(x,y) 

P(x)p(y) 


Alternative expressions 


H(X) = E p log 
H(X, Y) = E p log 
H(X\Y) = E p \og 
/(X; Y) = E p log 
D{p\\q) = E p log 


P(X )， 


P(X, y) 3 4 5 * 7 

1 

p(x\yy 

P(X, Y) 
p(X)p(Y) 
P(X) 

q(X)' 


(2.159) 

(2.160) 
(2.161) 
(2.162) 

(2.163) 

(2.164) 


Properties of D and I 


1. /(Z; Y) = H(X) - H(X\Y) = H{Y) - H{Y\X) = H(X) + 
H(Y)~ H(X, Y). 

2. D(p || ^) > 0 with equality if and only if p(x) = q(x), for all x G 

3. I{X\Y) = D(p(x, y)\\p(x)p(y)) > 0, with equality if and only if 
p(x, y) = p{x)p{y) (i.e., X and Y are independent). 

4. If I T|= m, and u is the uniform distribution over X, then D(p || 
u) = logm — H{p). 

5. D{p\\q) is convex in the pair (p, q). 

Chain rules 

Entropy: H(X U X 2 ， … ， X n )=, X { ). 

Mutual information: 

i(x u x 2 , … ， u) = ELi nxr, y\x u x 2 , •••，u 
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Relative entropy: 

D(p(x, y)\\q(x, y)) = D(p(x)\\q(x)) + D(p(y\x)\\q(y\x)). 

Jensen’s inequality. If / is a convex function, then Ef(X) > f{EX). 

Log sum inequality. For n positive numbers, a\, a 2 ,..., a n and 
办 l ，办 2, • • • ， bn ， 

^2 ai lo § ^ -(右 a; ) lQ g 邑 r 1 (2.165) 

with equality if and only if 茫 =constant. 

Data-processing inequality. If X ^ Y ^ Z forms a Markov chain, 
/(X; Y) > /(X; Z). 

Sufficient statistic. T(X) is sufficient relative to {fe(x)} if and only 
if 1(0; X) = 1(0; T (Z)) for all distributions on 9. 

Fano’s inequality. Let P e = Pr{Z(y) ^ X}. Then 

H(P e ) + P e \og\X\ >H(X\Y). (2.166) 

Inequality. If X and X' are independent and identically distributed, 
then 

Pr(X = X f ) > 2~ h(x \ (2.167) 


PROBLEMS 

2.1 Coin flips. A fair coin is flipped until the first head occurs. Let 
X denote the number of flips required. 

(a) Find the entropy H{X) in bits. The following expressions may 
be useful: 





(b) A random variable X is drawn according to this distribution. 
Find an “efficient” sequence of yes-no questions of the form, 
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<4 Is X contained in the set ST’ Compare H{X) to the expected 
number of questions required to determine X. 

2.2 Entropy of functions • Let X be a random variable taking on a 
finite number of values. What is the (general) inequality relation¬ 
ship of H(X) and H(Y) if 

(a) Y = 2 X 1 

(b) Y = cos XI 

2.3 Minimum entropy. What is the minimum value of 
H(p \,..., p n ) = //(p) as p ranges over the set of n-dimensional 
probability vectors? Find all p’s that achieve this minimum. 

2.4 Entropy of functions of a random variable. Let X be a discrete 
random variable. Show that the entropy of a function of X is less 
than or equal to the entropy of X by justifying the following steps: 


H(X, g(X)) ( = } H(X) + H(g(X) | Z) 

(2.168) 

^ H(X), 

(2.169) 

H(X, g(X)) ( =' H(g(X)) + H(X | g(X)) 

(2.170) 

(d) 

(2.171) 


Thus, H(g(X)) < H(X). 

2.5 Zero conditional entropy• Show that if H{Y\X) = 0, then Y is 
a function of X [i.e., for all x with p(x) > 0, there is only one 
possible value of y with p(x, y) > 0]. 

2.6 Conditional mutual information vs. unconditional mutual informa¬ 
tion .Give examples of joint random variables X, Y, and Z 
such that 

⑻ i(x；ri z) < i(x ； v). 

(b) i(x；r i z) >/(x ； Y). 

2.7 Coin weighing. Suppose that one has n coins, among which there 
may or may not be one counterfeit coin. If there is a counterfeit 
coin, it may be either heavier or lighter than the other coins. The 
coins are to be weighed by a balance. 

(a) Find an upper bound on the number of coins n so that k 
weighings will find the counterfeit coin (if any) and correctly 
declare it to be heavier or lighter. 
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(b) (Difficult) What is the coin- weighing strategy for k = 3 weigh¬ 
ings and 12 coins? 

2.8 Drawing with and without replacement. An urn contains r red, w 
white, and b black balls. Which has higher entropy, drawing k >2 
balls from the urn with replacement or without replacement? Set it 
up and show why. (There is both a difficult way and a relatively 
simple way to do this.) 

2.9 Metric. A function p{x, y) is a metric if for all x, y, 

• p(x, y) > 0. 

• P(x, y) = p{y,x). 

• p(x, y) = 0 if and only if x = y. 

• p(x,y) + p(} ； ,z) > p(x,z). 

(a) Show that p(X, Y) = H{X\Y) + H(Y\X) satisfies the first, 
second, and fourth properties above. If we say that X = Y if 
there is a one-to-one function mapping from X to Y, the third 
property is also satisfied, and p(X, Y) is a metric. 

(b) Verify that p{X, Y) can also be expressed as 

p(X, Y) = H(X) + H{Y)~ 2/(X; Y) (2.172) 
= H(X,Y)- I(X] Y) (2.173) 

= 2H(X ， Y) - H(X) - H{Y). (2.174) 

2.10 Entropy of a disjoint mixture. Let X\ and X 2 be discrete random 
variables drawn according to probability mass functions pi(-) and 
P 2 (.) over the respective alphabets X\ = {1 ， 2, •.., m} and X 2 = 
{m + 1 ， … ， n}. Let 

X — X\ with probability a, 

X 2 with probability l — a. 

(a) Find H{X) in terms of hf(X 2 ), and a. 

(b) Maximize over a to show that 2 H ^ < 2 H ^ Xl ^ + 2 H ( X2 、and 
interpret using the notion that 2 H ^ is the effective alpha¬ 
bet size. 

2.11 Measure of correlation • Let X\ and X 2 be identically distributed 
but not necessarily independent. Let 


H(X 2 1 XQ 
H{X x ) 
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(a) Show that p = ' 

(b) Show that 0 < p < 1. 

(c) When is p = 0? 

(d) When is p = 1? 

2.12 Example of joint entropy. Let p(x, y) be given by 



Find: 

(a) //(X), H{Y). 

(b) H(X I Y),H(Y I X). 

(c) H(X ， Y). 

(d) H(Y)~ H(Y I X). 

(e) I(X;Y). 

(f) Draw a Venn diagram for the quantities in parts (a) through (e). 

2.13 Inequality. Show that In x > 1 — ^ for x > 0. 

2.14 Entropy of a sum. Let X and Y be random variables that take 
on values x\, X 2 ,.. •, x r and y\, y 2 , • • •, respectively. Let Z = 
X + Y. 

(a) Show that H(Z\X) = H{Y\X). Argue that if X,Y are inde¬ 
pendent, then //(y) < H(Z) and H(X) < H{Z). Thus, the 
addition of independent random variables adds uncertainty. 

(b) Give an example of (necessarily dependent) random variables 
in which H(X) > H(Z) and H(Y) > H(Z). 

(c) Under what conditions does H(Z) = H(X) + H{Y)1 

2.15 Data processing• Let Xi —> X 2 — > X 3 — > • • • ^ form a 
Markov chain in this order; that is, let 

p{xi,x 2 , •••，〜）= p(xi)p(x 2 \xi) p(x n \x n -i). 

Reduce I{X\ \ X 2 ,..., X n ) to its simplest form. 

2.16 Bottleneck. Suppose that a (nonstationary) Markov chain starts 
in one of n states, necks down to k < n states, and then 
fans back to m > k states. Thus, ^ 1 ^X 2 ^ X 3 , that is, 
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p{xx,x 2 ,x^) = p(xi)p(x 2 \xi)p(x 3 \x 2 ) 9 for all x\ e { 1 , 2 , 

X 2 ^ {1,2,..., k}, X 3 G {1, 2 ,..., m). 

(a) Show that the dependence of X\ and X3 is limited by the 
bottleneck by proving that /(Xi ； X 3 ) < logL 

(b) Evaluate I(X \； X 3 ) for 众 = 1， and conclude that no depen¬ 
dence can survive such a bottleneck. 

2.17 Pure randomness and bent coins. Let X\, X 2 ,.. •, X n denote the 
outcomes of independent flips of a bent coin. Thus, Pr {X/ = 
1 } = p，Pr {Xi = 0} = l — p, where p is unknown. We wish 
to obtain a sequence Z\, Z 2 , ..., Zk of fair coin flips from 
X\ ， X 2 ,..., X n . Toward this end, let f \ X n ^ {0, 1}* (where 
{0, 1}* = {A, 0, 1, 00, 01,...} is the set of all finite-length binary 
sequences) be a mapping f(X\, X 2 , … ， X n ) = (Zi, Z 2 , … ， Zk), 
where Z/ 〜 Bernoulli (|), and K may depend on (X\, … ， X n ). 
In order that the sequence Zi, Z 2 ,... appear to be fair coin flips, 
the map / from bent coin flips to fair flips must have the prop¬ 
erty that all 2 k sequences (Zi, Z 2 ,. • • ， Zk) of a given length k 
have equal probability (possibly 0)，for k = 1,2,.... For example, 
for n = 2, the map /(01) = 0, /(10) = 1 ， /_ = fill) = A 
(the null string) has the property that Pr{Zi = \\K = 1} = Pr{Zi = 
0| 尺 =1} = 士 . Give reasons for the following inequalities: 


nH(p) H(X u ...,X n ) 

2 : H(Z\, Z 2 ,..., Zk, K) 

=H(K) + H(Z u ...,Z k \K) 
= H(K) + E{K) 


Thus, no more than nH(p) fair coin tosses can be derived from 
(Xi, • • •, X n ), on the average. Exhibit a good map / on sequences 
of length 4. 

2.18 World Series • The World Series is a seven-game series that termi¬ 
nates as soon as either team wins four games. Let X be the random 
variable that represents the outcome of a World Series between 
teams A and B; possible values of X are AAAA, BABABAB, and 
BBBAAAA. Let Y be the number of games played, which ranges 
from 4 to 7. Assuming that A and B are equally matched and that 


48 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION 

the games are independent, calculate H(X), H(Y), H{Y\X), and 
H(X\Y). 

2.19 Infinite entropy . This problem shows that the entropy of a discrete 
random variable can be infinite. Let A = 2^=2(n log 2 n) 一 1 • [It is 
easy to show that A is finite by bounding the infinite sum by the 
integral of (x log 2 x) -1 .] Show that the integer-valued random vari¬ 
able X defined by Pr(X = n) = (An log 2 n)~ l for " = 2 ， 3 ，…， 
has H{X) — +oc. 

2.20 Run-length coding. Let X\, X 2 ,..., X n be (possibly dependent) 
binary random variables. Suppose that one calculates the run 
lengths R = (7?i ， /? 2 , • • .）of this sequence (in order as they 
occur). For example, the sequence X = 0001100100 yields run 
lengths R= (3,2,2, 1,2). Compare H(X U X 2 , … ， X n ), H(R), 
and H(X n , R). Show all equalities and inequalities, and bound all 
the differences. 

2.21 Markov’s inequality for probabilities. Let p(x) be a probability 
mass function. Prove, for all d >0, that 

Pr{p(X)<d} log^< H{X). (2.175) 

2.22 Logical order of ideas • Ideas have been developed in order of 
need and then generalized if necessary. Reorder the following ideas, 
strongest first, implications following: 

(a) Chain rule for I(X\, …， X n \ F), chain rule for D(p(x \, …， 
x n )\\q(x\, 义 2 , . • • ， x n )), and chain rule for H(X\, X 2 , . •. ， X n ). 

(b) D(f\\g) > 0, Jensen’s inequality, I(X; Y) > 0. 

2.23 Conditional mutual information • Consider a sequence of n binary 
random variables X\, X 2 , ..., X n . Each sequence with an even 
number of l’s has probability 2—( n_1 )，and each sequence with an 
odd number of Ts has probability 0. Find the mutual informations 

/(X 1; X 2 ), I{X 2 \ X 3 \X x ),..., I(X n -i; X, 卜 2 ). 

2.24 Average entropy. Let H(p) = — p log 2 p — (1 — p) log 2 (l — p) 
be the binary entropy function. 

(a) Evaluate U ( 長 ) using the fact that log 2 3 ^ 1.584. (Hint: You 
may wish to consider an experiment with four equally likely 
outcomes, one of which is more interesting than the others.) 
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(b) Calculate the average entropy H(p) when the probability p is 
chosen uniformly in the range 0<P<1. 

(c) (Optional) Calculate the average entropy H(p\, p 2 , P 3 )，where 
(pi, p 2 , 尸 3 ) is a uniformly distributed probability vector. Gen¬ 
eralize to dimension n. 

2.25 Venn diagrams. There isrft really a notion of mutual information 
common to three random variables. Here is one attempt at a defini¬ 
tion: Using Venn diagrams, we can see that the mutual information 
common to three random variables X, Y, and Z can be defined by 


/(X ； y ； Z) = /(X; 7)-/(X;F|Z). 

This quantity is symmetric in X, Y, and Z, despite the preceding 
asymmetric definition. Unfortunately, I{X\Y\ Z) is not necessar¬ 
ily nonnegative. Find X, Y, and Z such that I(X; Y\ Z) <0, and 
prove the following two identities: 

(a) /(X ； y ； Z) = H(X ， Y ， Z)- H(X)~ H(Y)- H(Z) + 

/(z ； r) + /(y ； z) + /(z ； x). 

(b) /(X; 7; Z) = H(X, y, Z) - H(X,Y)- H(Y, Z) - 
H(Z,X) + H(X) + H{Y) + H(Z). 

The first identity can be understood using the Venn diagram analogy 
for entropy and mutual information. The second identity follows 
easily from the first. 

2.26 Another proof of nonnegativity of relative entropy • In view of the 
fundamental nature of the result D(p\\q) > 0, we will give another 
proof. 

(a) Show that Inx < x — l for 0 < x < oc. 

(b) Justify the following steps: 


= ^p{x)\r\ 

X 

q(x) 

p(x) 

(2.176) 

< EpW ( 

X \ 

\ 

P(x)) 

(2.177) 

<0. 


(2.178) 


(c) What are the conditions for equality? 

2.27 Grouping rule for entropy. Let p = p 2 ,..., p m ) be a prob¬ 

ability distribution on m elements (i.e., pi > 0 and Y1T=\ Pi = 
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Define a new distribution q on m — 1 elements as = p\,q 2 = Pi, 
• • • ， q m -2 = p m - 2 , and q m —\ = p m -\ + p m [i.e.，the distribution q 
is the same as p on {1, 2,..., m — 2}, and the probability of the 
last element in q is the sum of the last two probabilities of p]. 
Show that 


H(p) = H(q) + (p m _i + p m )H ^ ^^ — . 

\ Pm—l I Pm Pm—l i Pm J 

(2.179) 

2.28 Mixing increases entropy. Show that the entropy of the proba¬ 

bility distribution, (pi, … ， p“ … ， p 』， … ,p m ), is less than the 
entropy of the distribution , Pi+ 2 Pj ， ..., Pi+ 2 Pj ， 

..., p m ). Show that in general any transfer of probability that 
makes the distribution more uniform increases the entropy. 

2.29 Inequalities . Let Z, F, and Z be joint random variables. Prove 
the following inequalities and find conditions for equality. 

(a) H(X, Y\Z) > H(X\Z). 

(b) /(Z,y ； Z)>/(X;Z). 

(c) H(X ， Y,Z) - H(X, Y) < H(X, Z) - H{X). 

(d) /(X; Z\Y) > HZ' Y\X)~ I(Z; Y) + /(X; Z). 

2.30 Maximum entropy • Find the probability mass function p(x) that 
maximizes the entropy H{X) of a nonnegative integer-valued ran¬ 
dom variable X subject to the constraint 


oo 

EX = = A 

71=0 

for a fixed value A > 0. Evaluate this maximum H{X). 

2.31 Conditional entropy. Under what conditions does H(X\g(Y))= 
H{X\Y)1 

2.32 Fano. We are given the following joint distribution on (X, Y): 


X 

a 

b 

C 

1 

1 

6 

l 

T2 

1 

12 

2 

l 

TI 

1 

6 

1 

T2 

3 

1 

T2 

1 

T2 

1 

6 
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Let X{Y) be an estimator for X (based on Y) and let P e = 
Pr{X(Y) ^ X}. 

(a) Find the minimum probability of error estimator X(y) and the 
associated P e . 

(b) Evaluate Fano’s inequality for this problem and compare. 

2.33 Fano，s inequality. Let Pr(X = /) = p“ i = 1, 2 , ..., m, and let 

Pi ^ P 2 ^ P 3 ^ ^ Pm- The minimal probability of error pre¬ 

dictor of Z is X = 1, with resulting probability of error P e = 
1 — p\. Maximize H(p) subject to the constraint l — p\ = P e to 
find a bound on P e in terms of H. This is Fano’s inequality in the 
absence of conditioning. 

2.34 Entropy of initial conditions • Prove that H(Xo\X n ) is nondecreas¬ 
ing with n for any Markov chain. 

2.35 Relative entropy is not symmetric. 

Let the random variable X have three possible outcomes {a, b, c). 
Consider two distributions on this random variable: 


Symbol 

P(x) 

q(x) 

a 

1 

2 

i 

3 

b 

1 

4 

1 

3 

c 

1 

4 

1 

3 


Calculate H(p) ， H(q), D(p\\q), and D(q\\p). Verify that in this 
case, D(p\\q) ^ D(q\\p). 

2.36 Symmetric relative entropy • Although, as Problem 2.35 shows, 
D(p\\q) ^ D(q\\p) in general, there could be distributions for 
which equality holds. Give an example of two distributions p and 
^ on a binary alphabet such that D(p\\q) = D(q\\p) (other than 
the trivial case p = q). 

2.37 Relative entropy • Let X, F, Z be three random variables with a 
joint probability mass function p(x, y, z). The relative entropy 
between the joint distribution and the product of the marginals is 

D(p(x, y, z)\\p(x)p(y)p(z)) = E log 广’广:! 、 • ( 2 . 180 ) 

L p(x)p(y)p(z)j 

Expand this in terms of entropies. When is this quantity zero? 
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2.38 The value of a question. Let X ~ p(x), x = 1,2, ..., m. We are 
given a set 5 c {1, 2,, m). We ask whether X e S and receive 
the answer 

)1 if X e S 
r = l 0 if X ^S. 

Suppose that Pr{Z g 5} = a. Find the decrease in uncertainty 
H(X) - H(X\Y). 

Apparently, any set S with a given a is as good as any other. 

2.39 Entropy and pairwise independence. Let X, Y, Z be three binary 
Bernoulli ( 士 ） random variables that are pairwise independent; that 
is, I(X; Y) = /(X;Z) = 7(7; Z) = 0. 

(a) Under this constraint, what is the minimum value for 
//(Z, F, Z)? 

(b) Give an example achieving this minimum. 

2.40 Discrete entropies • Let X and Y be two independent integer¬ 

valued random variables. Let X be uniformly distributed over {1, 2, 
… ， 8}，and let Pr{F = k} = 2- k , k = 1 ， 2, 3 , _ 

(a) Find H(X). 

(b) Find H(Y). 

(c) Find H(X + Y,X- Y). 

2.41 Random questions. One wishes to identify a random object X ~ 
p(x). A question Q ~ r(q) is asked at random according to r{q). 
This results in a deterministic answer A = A(x, q) G {a\, « 2 ,.. .}• 
Suppose that X and Q are independent. Then J(X; Q, A) is the 
uncertainty in X removed by the question - answer (Q, A). 

(a) Show that I(X; Q, A) = H(A\Q). Interpret. 

(b) Now suppose that two i.i.d. questions 2 i ， G2, 〜厂⑷ are 
asked, eliciting answers A\ and Show that two questions 
are less valuable than twice a single question in the sense that 
I(X;Q u A u Q 2 ,A 2 )< 2/(X; Q u Ai). 

2.42 Inequalities . Which of the following inequalities are generally 
>,=,<? Label each with >,=, or <. 

(a) H(5X) vs. H(X) 

(b) 7(g(X);7) vs. I(X-Y) 

(c) H(X 0 \X^) vs. HiXolX^uXO 

(d) H(X, Y)/(H(X) + H(Y)) vs. 1 
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2.43 Mutual information of heads and tails 

(a) Consider a fair coin flip. What is the mutual information 
between the top and bottom sides of the coin? 

(b) A six-sided fair die is rolled. What is the mutual information 
between the top side and the front face (the side most facing 
you)? 

2.44 Pure randomness • We wish to use a three-sided coin to generate 
a fair coin toss. Let the coin X have probability mass function 


A, 

PA 

B, 

PB 

c, 

Pc ， 


where Pa ， Pb ，Pc are unknown. 

(a) How would you use two independent flips X\, X 2 to generate 
(if possible) a Bernoulli(^) random variable Z? 

(b) What is the resulting maximum expected number of fair bits 
generated? 

2.45 Finite entropy • Show that for a discrete random variable X g 
{ 1,2,...}, if E log X < 00 , then H(X) < 00 . 

2.46 Axiomatic definition of entropy (Difficult). If we assume certain 
axioms for our measure of information, we will be forced to use a 
logarithmic measure such as entropy. Shannon used this to justify 
his initial definition of entropy. In this book we rely more on the 
other properties of entropy rather than its axiomatic derivation to 
justify its use. The following problem is considerably more difficult 
than the other problems in this section. 

If a sequence of symmetric functions p 2 , … ， p m ) satisfies 

the following properties: 

• Normalization: // 2 ( 士 , 士 ) = 1 ， 

• Continuity: // 2 (p，1 _ p) is a continuous function of p, 

• Grouping: H m (pu • • • ， Pm) = ^m-i(Pi + Pi, 厂 3 , • • • ， Pm) + 

{pi + P2)H 2 {j^- 2 ，^), 

prove that H m must be of the form 

m 

H m (pu P2, … ， Pm) = ~Y^Pi lo S Pi^ m = 2,3,… . 

i=l 


( 2 . 181 ) 
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There are various other axiomatic formulations which result in the 
same definition of entropy. See, for example, the book by Csiszar 
and Komer [149]. 

2.47 Entropy of a missorted file. A deck of n cards in order 1, 2, ..., n 
is provided. One card is removed at random, then replaced at ran¬ 
dom. What is the entropy of the resulting deck? 


2.48 Sequence length. How much information does the length of a 
sequence give about the content of a sequence? Suppose that we 
consider a Bernoulli ( 臺 ） process {X/}. Stop the process when the 
first 1 appears. Let N designate this stopping time. Thus, X N is an 
element of the set of all finite-length binary sequences {0, 1}* = 
{0, 1,00,01, 10, 11,000 ,...}. 

(a) Find I(N]X n ). 

(b) Find H(X n \N). 

(c) Find H(X n ). 

Let’s now consider a different stopping time. For this part, again 
assume that X/ 〜 Bemoulli(^) but stop at time N = 6, with prob¬ 
ability I and stop at time N = 12 with probability Let this 
stopping time be independent of the sequence X 1 X 2 - - - X\ 2 . 

(d) Find I(N; X N ). 

(e) Find H(X n \N). 

(f) Find H(X n ). 

HISTORICAL NOTES 


The concept of entropy was introduced in thermodynamics, where it 
was used to provide a statement of the second law of thermodynam¬ 
ics. Later, statistical mechanics provided a connection between thermo¬ 
dynamic entropy and the logarithm of the number of microstates in a 
macrostate of the system. This work was the crowning achievement of 
Boltzmann, who had the equation S = kin W inscribed as the epitaph on 
his gravestone [361]. 

In the 1930s, Hartley introduced a logarithmic measure of informa¬ 
tion for communication. His measure was essentially the logarithm of the 
alphabet size. Shannon [472] was the first to define entropy and mutual 
information as defined in this chapter. Relative entropy was first defined 
by Kullback and Leibler [339]. It is known under a variety of names, 
including the Kullback-Leibler distance, cross entropy, information diver¬ 
gence, and information for discrimination, and has been studied in detail 
by Csiszar [138] and Amari [22]. 
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Many of the simple properties of these quantities were developed by 
Shannon. Fano’s inequality was proved in Fano [201]. The notion of 
sufficient statistic was defined by Fisher [209], and the notion of the 
minimal sufficient statistic was introduced by Lehmann and Scheffe [350]. 
The relationship of mutual information and sufficiency is due to Kullback 
[335]. The relationship between information theory and thermodynamics 
has been discussed extensively by Brillouin [77] and Jaynes [294]. 

The physics of information is a vast new subject of inquiry spawned 
from statistical mechanics, quantum mechanics, and information theory. 
The key question is how information is represented physically. Quan¬ 
tum channel capacity (the logarithm of the number of distinguishable 
preparations of a physical system) and quantum data compression [299] 
are well-defined problems with nice answers involving the von Neumann 
entropy. A new element of quantum information arises from the exis¬ 
tence of quantum entanglement and the consequences (exhibited in Bell’s 
inequality) that the observed marginal distribution of physical events are 
not consistent with any joint distribution (no local realism). The funda¬ 
mental text by Nielsen and Chuang [395] develops the theory of quantum 
information and the quantum counterparts to many of the results in this 
book. There have also been attempts to determine whether there are 
any fundamental physical limits to computation, including work by Ben¬ 
nett [47] and Bennett and Landauer [48]. 



CHAPTER 3 


ASYMPTOTIC EQUIPARTITION 
PROPERTY 


In information theory, the analog of the law of large numbers is the 
asymptotic equipartition property (AEP). It is a direct consequence 
of the weak law of large numbers. The law of large numbers states 
that for independent, identically distributed (i.i.d.) random variables, 
^ Y^i=\ is close to its expected value EX for large values of n. 
The AEP states that ^ log 尸⑻ j ——— is close to the entropy H, where 
X\, X 2 ,... ,X n are i.i.d. random variables and p{X\, X 2 , … ， X n ) is the 
probability of observing the sequence X 2 ,..., X n . Thus, the proba¬ 
bility p(X\, X 2 , … ， X n ) assigned to an observed sequence will be close 
to 2 - nH . 

This enables us to divide the set of all sequences into two sets, the 
typical set, where the sample entropy is close to the true entropy, and the 
nontypical set, which contains the other sequences. Most of our attention 
will be on the typical sequences. Any property that is proved for the typical 
sequences will then be true with high probability and will determine the 
average behavior of a large sample. 

First, an example. Let the random variable X g {0, 1} have a probability 
mass function defined by p(V) = p and p(0) = q. If X\, X 2 ,..., X n are 
i.i.d. according to p(x )，the probability of a sequence x\, X 2 ,..., x n is 
YYi=\ P( x i)- For example, the probability of the sequence (1, 0, 1, 1, 0, 1) 
is pL Xi q n H = p 4 q 2 . Clearly, it is not true that all 2 n sequences of 
length n have the same probability. 

However, we might be able to predict the probability of the sequence 
that we actually observe. We ask for the probability p(X\, X 2 ,..., X n ) of 
the outcomes X\, X 2 , … ， X n , where X\, X 2 , … are i.i.d • 〜 p{x). This is 
insidiously self-referential, but well defined nonetheless. Apparently, we 
are asking for the probability of an event drawn according to the same 


Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A. Thomas 
Copyright © 2006 John Wiley & Sons, Inc. 


57 





58 ASYMPTOTIC EQUIPARTITION PROPERTY 


probability distribution. Here it turns out that p(X\, X 2 , ..., X n ) is close 
to 2~ nIi with high probability. 

We summarize this by saying, “Almost all events are almost equally 
surprising.” This is a way of saying that 

Pr {(X 1 ,Z 2 ,...,X„) : p(X u X 2 ,..., X n ) = 2 - n(ff±e) } ^ 1 (3.1) 

if X\, X 2 ,..., X n are i.i.d . 〜 p(x). 

In the example just given, where p{X\, X 2 , • • •, X n ) = p^ Xi q n ~^ Xi 9 
we are simply saying that the number of l’s in the sequence is close 
to np (with high probability), and all such sequences have (roughly) the 
same probability 2~ nH( ' p \ We use the idea of convergence in probability, 
defined as follows: 

Definition {Convergence of random variables). Given a sequence of 
random variables, X\, X 2 , ..., we say that the sequence X\, X 2 , ... con¬ 
verges to a random variable X: 

1. In probability if for every 6 > 0, Pr{\X n — X\ > 6} —> 0 

2. In mean square if E{X n — X) 2 0 

3. With probability 1 (also called almost surely) if Pr{lim„^oo X n = 
X) = 1 


3.1 ASYMPTOTIC EQUIPARTITION PROPERTY THEOREM 

The asymptotic equipartition property is formalized in the following 
theorem. 


Theorem 3.1.1 (AEP) If X\, X 2 ,... are i.i.d. 〜 p(x)，then 

1 

—— log p(X\, X 2 , ... ， X n ) —> H(X) in probability. (3.2) 
n 

Proof: Functions of independent random variables are also independent 
random variables. Thus, since the Xj are i.i.d., so are log p(Xi). Hence, 
by the weak law of large numbers, 

--log P(X U X: Xn) = --Vlog p{Xi) (3.3) 

n n 


-E log p(X) in probability 


(3.4) 

(3.5) 


which proves the theorem. 


H(X), 


□ 
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Definition The typical set with respect to p{x) is the set of se¬ 
quences (xi, X 2 ,, x n ) G X n with the property 

2 —n(ff(x)+€) $ p ( XlfX2f ...,x n )< 2 - n(H(x) - (3.6) 

As a consequence of the AEP, we can show that the set has the 
following properties: 

Theorem 3.1.2 


1. If (xi,x 2 , G A^\ then H(X) -€ < -^log p(xi,x 2 , …， 

x n )<H{X) + €. 

2. Pr{A^^} > 1 — e for n sufficiently large. 

3. \A^\ < 2 n( ^ H ^ x ^ € \ where \A\ denotes the number of elements in the 
set A. 

4. |Ap)| > (1 — for n sufficiently large. 


Thus, the typical set has probability nearly 1, all elements of the typical 
set are nearly equiprobable, and the number of elements in the typical set 
is nearly 2 nH . 


Proof: The proof of property (1) is immediate from the definition of 
A^\ The second property follows directly from Theorem 3.1.1, since the 
probability of the event (X\, X 2 ,..., X n ) e tends to 1 as w ^ 00 . 
Thus, for any S > 0, there exists an no such that for all n > no, we have 


Pr 


--logp(X u X 2 ,...,X n )-H(X) 

n 


< € } > 1 — (5. 


(3.7) 


Setting <5 = 6, we obtain the second part of the theorem. The identification 
of 8 = e will conveniently simplify notation later. 

To prove property (3)，we write 

1 = P(x) (3.8) 

\eX n 

> Y. 尸 ( x ) (3-9) 

xe4 n) 

> ^ 2 -n(H{X)+e) (3.10) 

X6Ai n) 

= 2 -»( ff ( Z )+ e )| A («)| ； (3.11) 
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where the second inequality follows from (3.6). Hence 

| A (n)| < 2 n(H(X)+ e )^ (3.12) 

Finally, for sufficiently large n, Pr{Ap)} > 1 — e, so that 

1 - e < Pr{A< n) } (3.13) 

< ( 則 (3.14) 

xe4") 

= 2 - n (//(x)K)l ， (3.15) 

where the second inequality follows from (3.6). Hence, 

|Af } | > (1 - e )2 n(H(x) - € \ (3.16) 

which completes the proof of the properties of A^\ □ 

3.2 CONSEQUENCES OF THE AEP: DATA COMPRESSION 

Let X\, X 2 ,..., be independent, identically distributed random vari¬ 
ables drawn from the probability mass function p{x). We wish to find 
short descriptions for such sequences of random variables. We divide all 
sequences in X n into two sets: the typical set and its complement, 
as shown in Figure 3.1. 



FIGURE 3.1. Typical sets and source coding. 
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We order all elements in each set according to some order (e.g. ， lexi¬ 
cographic order). Then we can represent each sequence of by giving 
the index of the sequence in the set. Since there are < sequences 

in A^\ the indexing requires no more than n(H + 6) + 1 bits. [The extra 
bit may be necessary because n(H + 6) may not be an integer.] We pre¬ 
fix all these sequences by a 0， giving a total length of < n(H + 6) + 2 
bits to represent each sequence in (see Figure 3.2). Similarly, we can 
index each sequence not in by using not more than n log |A1 + 1 bits. 
Prefixing these indices by 1, we have a code for all the sequences in X n . 

Note the following features of the above coding scheme: 

• The code is one-to-one and easily decodable. The initial bit acts as 
a flag bit to indicate the length of the codeword that follows. 

• We have used a brute-force enumeration of the atypical set A^ c 
without taking into account the fact that the number of elements in 
A^ )C is less than the number of elements in X n . Surprisingly, this is 
good enough to yield an efficient description. 

• The typical sequences have short descriptions of length ^ nH. 

We use the notation to denote a sequence x\, X 2 ,..., x n . Let l(x n ) 
be the length of the codeword corresponding to x n . If n is sufficiently 
large so that Pr{Af)} >1—6, the expected length of the codeword is 

E(l(X n )) = J2p(x n )l(x n ) (3.17) 

X n 























62 ASYMPTOTIC EQUIPARTITION PROPERTY 


=^2 p(x n )l(x n ) + ^ p(x n )l(x n ) 


(3.18) 


x n eAi n) x n eAi n)C 


< J2 P(x n )(n(H + e) + 2) 
x n eAi n) 


x n eA, 


+ H p(x n )(n\og\X\ + 2 ) 

x n eAi n)C 


(3.19) 


Pr {Af>} (n(H + e) + 2) + Pr {a^ )C J (nlog\X\+2) 


<n(H + e) + en(\og\X\) + 2 
= n(H + € r ), 


(3.20) 

(3.21) 

(3.22) 


where = e € log \X\ + ^ can be made arbitrarily small by an appro¬ 
priate choice of 6 followed by an appropriate choice of n. Hence we have 
proved the following theorem. 

Theorem 3.2.1 Let X n be i.i.d. 〜 p(x). Let ^ > 0. Then there exists a 
code that maps sequences x n of length n into binary strings such that the 
mapping is one-to-one (and therefore invertible) and 


E -l{X n ) < H{X) + e 


(3.23) 


for n sufficiently large. 

Thus, we can represent sequences X n using nH(X) bits on the average. 

3.3 HIGH-PROBABILITY SETS AND THE TYPICAL SET 

From the definition of A^\ it is clear that is a fairly small set that 
contains most of the probability. But from the definition, it is not clear 
whether it is the smallest such set. We will prove that the typical set has 
essentially the same number of elements as the smallest set, to first order 
in the exponent. 

Definition For each n = 1, 2, • • • ， let c X n be the smallest set 


with 


Pr{B 3 w } >1-5. 


(3.24) 
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We argue that B?) must have significant intersection with and there¬ 
fore must have about as many elements. In Problem 3.3.11, we outline 
the proof of the following theorem. 

Theorem 3.3.1 Let X\, X 2 , ••” X n be i.i.d. 〜 p(x). For ^ \ and 

any > 0, ifPr{B^} >1—5, then 

— log II > H — 8’ for n sufficiently large. (3.25) 
n 

Thus, B^ l) must have at least 2 nH elements, to first order in the expo¬ 
nent. But has 2"( 尺 ±€ ) elements. Therefore, is about the same 
size as the smallest high-probability set. 

We will now define some new notation to express equality to first order 
in the exponent. 

Definition The notation a n = b n means 

lim -log — - 0. (3.26) 

n^oo Yl b n 

Thus, a n = b n implies that a n and b n are equal to the first order in the 
exponent. 

We can now restate the above results: If ^ 0 and € n 0, then 

\B^\=\Al ] \=2 nH . (3.27) 

To illustrate the difference between and B^ n \ let us con¬ 

sider a Bernoulli sequence X\, X 2 ,..., X n with parameter p = 0.9. [A 
Bernoulli(0) random variable is a binary random variable that takes on 
the value 1 with probability 6.] The typical sequences in this case are the 
sequences in which the proportion of Vs is close to 0.9. However, this 
does not include the most likely single sequence, which is the sequence of 
all Vs. The set includes all the most probable sequences and there¬ 
fore includes the sequence of all Vs. Theorem 3.3.1 implies that and 
B^ ll) must both contain the sequences that have about 90% Vs, and the 
two sets are almost equal in size. 
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SUMMARY 

AEP. “Almost all events are almost equally surprising.” Specifically, 
if X\, X 2 , … are i.i.d. ~ p(x )，then 

1 

--log p(X u X 2 ,... ， X n ) -> H(X) in probability. (3.28) 
n 

Definition. The typical set is the set of sequences x\, X 2 ,..., x n 
satisfying 

2 -n(H(X)+e) < p( XuX2t . . . , x n ) < 2 - n(H{X) ~ €) . (3.29) 

Properties of the typical set 

1. If (x\, X2,, x n ) e A^ n \ then p(x\,X2, ..., x n ) = 2~ n( ^ H±€ \ 

2. Pr {Ap)} > 1 — 6 for n sufficiently large. 

3. \A^\ < 2 n ( 付 ( x ) +e )，where \A\ denotes the number of elements in 
set A. 

Definition. a n =b n means that ^ log ^ 0 as n ^ oo. 

Smallest probable set. Let X\, X 2 ,..., be i.i.d. 〜 p(x), and for 
<5 < ^, let C X n be the smallest set such that Pr{B^} >1—8. 
Then 

(3.30) 


PROBLEMS 

3.1 Markov’s inequality and Chebyshev y s inequality 

(a) (Markov y s inequality) For any nonnegative random variable X 
and any f > 0, show that 


Pr{X>^}<—. (3.31) 

Exhibit a random variable that achieves this inequality with 
equality. 

(b) (Chebyshev’s inequality) Let y be a random variable with 
mean /x and variance a 2 . By letting X = (Y — /x) 2 , show that 



for any € > 0, 
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Pr{|F-/x| >6} < (3.32) 

(c) {Weak law of large numbers) Let Zi, Z 2 ,..., Z n be a sequence 
of i.i.d. random variables with mean /x and variance a 2 . Let 

n 

Z n = -Z/ be the sample mean. Show that 

n i=l 

2 

Pr{|Z„-/x| >e}<^. (3.33) 

11 J 


Thus, Pr {I Z„ — /x I >6 }—^Oasw^oc. This is known as the 
weak law of large numbers. 

3.2 AEP and mutual information. Let (U/) be i.i.d. ~ p(x, y). We 
form the log likelihood ratio of the hypothesis that X and Y are 
independent vs. the hypothesis that X and Y are dependent. What 
is the limit of 


n l0g 


p(X n )p(Y' 

p{X\Y n ) 


3.3 Piece of cake. 

A cake is sliced roughly in half, the largest piece being chosen each 
time, the other pieces discarded. We will assume that a random cut 
creates pieces of proportions 


j (f ， 臺） with probability I 
I (|, |) with probability | 

Thus, for example, the first cut (and choice of largest piece) may 
result in a piece of size Cutting and choosing from this piece 
might reduce it to size (|) (|) at time 2, and so on. How large, to 
first order in the exponent, is the piece of cake after n cuts? 

3.4 AEP• Let ^ be iid 〜 p(x )，x G {1, 2, ..., m}. Let /x = EX and 
// = — [ p(x) log p(x). Let A n = {x n e X n : \ ^ log p(x n ) — 

H\ < e}. Let B" = {x n e : - Ml < e}. 

(a) Does Pr{X n e A n } — ^ 1? 

(b) Does Pr{X n e A n n B n ) — > 1? 
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(c) Show that n s 2 n ^ H+€) for all n. 

(d) Show that \A n H B n \ > (^) 2 n(<H ~^ for n sufficiently large. 

3.5 Sets defined by probabilities . Let X\, X 2 , ... be an i.i.d. sequence 
of discrete random variables with entropy H{X). Let 

C n {t) = {x n e X n : p(x n ) > 2 - nt } 

denote the subset of -sequences with probabilities > 2 - nt • 

(a) Show that \C n {t)\ < 2 nt . 

(b) For what values of t does P{[X n e C n {t)}) —> 1? 

3.6 AEP-like limit. Let X\, X 2 , ... be i.i.d. drawn according to prob¬ 
ability mass function p{x). Find 

lim (p(Xi, X 2 , … ， X n ))n. 

00 


3.7 AEP and source coding. A discrete memoryless source emits a 
sequence of statistically independent binary digits with probabilities 
p(l) = 0.005 and p(0) = 0.995. The digits are taken 100 at a time 
and a binary codeword is provided for every sequence of 100 digits 
containing three or fewer Vs. 

(a) Assuming that all codewords are the same length, find the min¬ 
imum length required to provide codewords for all sequences 
with three or fewer Vs. 

(b) Calculate the probability of observing a source sequence for 
which no codeword has been assigned. 

(c) Use Chebyshev’s inequality to bound the probability of observ¬ 
ing a source sequence for which no codeword has been assign¬ 
ed. Compare this bound with the actual probability computed 
in part (b). 


3.8 Products . 
Let 


X = 


1 ， with probability \ 
2, with probability \ 
3， with probability j 


Let X\, X 2 , … be drawn i.i.d. according to this distribution. Find 
the limiting behavior of the product 
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3.9 AEP. Let X\, X 2 ,... be independent, identically distributed ran¬ 
dom variables drawn according to the probability mass function 
p(x), x e {1,2,..., m). Thus, p{x\,X 2 , ...,x n ) = pte). We 
know that — ^ log p(X\, X 2 , … ， X n ) H(X) in probability. Let 
q(x\, X 2 ,..., x n ) = n『=i gte)，where q is another probability 
mass function on { 1 , 2, , m}. 

(a) Evaluate lim —ilog^(Xi, X 2 , … ， X n ), where X\, X 2 , … are 
i.i.d • 〜 p{x). 

(b) Now evaluate the limit of the log likelihood ratio 

士 log w h en ^ 1 , X 2 ,... are i.i.d. 〜 p(x). Thus, the 

odds favoring q are exponentially small when p is true. 

3.10 Random box size. 

An n-dimensional rectangular box with sides X\, X 2 , X 3 ,..., is 
to be constructed. The volume is V n = YYi=\ Xi. The edge length / 

• — . I In 

of a n-cube with the same volume as the random box is / = . 

Let Zi, X 2 , … be i.i.d. uniform random variables over the unit 
interval [0, 1]. Find lim„— 〜 V^ n and compare to {EV n )n. Clearly, 
the expected edge length does not capture the idea of the volume 
of the box. The geometric mean, rather than the arithmetic mean, 
characterizes the behavior of products. 

3.11 Proof of Theorem 3.3.1. This problem shows that the size of the 
smallest “probable” set is about 2 nH . Let X\, X 2 , …， be i.i.d. 
〜 p(x). Let C X n such that Pr(B^ n) ) > 1 — <5. Fix € < 

(a) Given any two sets A, B such that Pr(A) > 1 — 61 and Pr(5) > 
1 — € 2 , show that Pr(A fl S) > 1 — 61 — 62 . Hence, Pr(i4^) fl 

> l — e — S. 

(b) Justify the steps in the chain of inequalities 


8 < Pr(A^ n) n O 

(3.34) 

= E 扒？） 

(3.35) 


2 - 

(3.36) 


=|A^ :, n B^ n) \2~ n(H ~ e) 

(3.37) 

< \B^ n) \2 - n(H - e) . 

(3.38) 


(c) Complete the proof of the theorem. 
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3.12 Monotonic convergence of the empirical distribution. 

Let p n denote the empirical probability mass function correspond¬ 
ing to X 2 , …， X n i.i.d . 〜 p(x)，x e X. Specifically, 

1 n 

Pn(x) = -Tl(X i =x) 
n ^ 
i=l 

is the proportion of times that Xi = x in the first n samples, where 
I is the indicator function. 

(a) Show for X binary that 

ED{p 2 n II P) < ED(p n II p). 

Thus, the expected relative entropy “distance” from the empir¬ 
ical distribution to the true distribution decreases with sample 
size. (Hint: Write p 2 n = \pn + \p r n and use the convexity 
of D.) 

(b) Show for an arbitrary discrete X that 

ED(p n II p) < EDipn^ II p). 

(Hint: Write p n as the average of n empirical mass functions 
with each of the n samples deleted in turn.) 

3.13 Calculation of typical set. To clarify the notion of a typical set 

and the smallest set of high probability B^\ we will calculate 
the set for a simple example. Consider a sequence of i.i.d. binary 
random variables, X\, X 2 , … ， X n , where the probability that Xi = 
1 is 0.6 (and therefore the probability that Z/ = 0 is 0.4). 

(a) Calculate H(X). 

(b) With n = 25 and 6 = 0.1, which sequences fall in the typi¬ 
cal set Ap)? What is the probability of the typical set? How 
many elements are there in the typical set? (This involves com¬ 
putation of a table of probabilities for sequences with k Vs, 
0 < k < 25, and finding those sequences that are in the typi¬ 
cal set.) 

(c) How many elements are there in the smallest set that has prob¬ 
ability 0.9? 

(d) How many elements are there in the intersection of the sets in 
parts (b) and (c)? What is the probability of this intersection? 
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k 

(：) 


--logp(x n ) 

n 

0 

1 

0.000000 

1.321928 

1 

25 

0.000000 

1.298530 

2 

300 

0.000000 

1.275131 

3 

2300 

0.000001 

1.251733 

4 

12650 

0.000007 

1.228334 

5 

53130 

0.000054 

1.204936 

6 

177100 

0.000227 

1.181537 

7 

480700 

0.001205 

1.158139 

8 

1081575 

0.003121 

1.134740 

9 

2042975 

0.013169 

1.111342 

10 

3268760 

0.021222 

1.087943 

11 

4457400 

0.077801 

1.064545 

12 

5200300 

0.075967 

1.041146 

13 

5200300 

0.267718 

1.017748 

14 

4457400 

0.146507 

0.994349 

15 

3268760 

0.575383 

0.970951 

16 

2042975 

0.151086 

0.947552 

17 

1081575 

0.846448 

0.924154 

18 

480700 

0.079986 

0.900755 

19 

177100 

0.970638 

0.877357 

20 

53130 

0.019891 

0.853958 

21 

12650 

0.997633 

0.830560 

22 

2300 

0.001937 

0.807161 

23 

300 

0.999950 

0.783763 

24 

25 

0.000047 

0.760364 

25 

1 

0.000003 

0.736966 


HISTORICAL NOTES 


The asymptotic equipartition property (AEP) was first stated by Shan¬ 
non in his original 1948 paper [472], where he proved the result for 
i.i.d. processes and stated the result for stationary ergodic processes. 
McMillan [384] and Breiman [74] proved the AEP for ergodic finite 
alphabet sources. The result is now referred to as the AEP or the Shan¬ 
non -McMillan - B reiman theorem. Chung [101] extended the theorem to 
the case of countable alphabets and Moy [392], Perez [417], and Kieffer 
[312] proved the C\ convergence when {X/} is continuous valued and 
ergodic. Barron [34] and Orey [402] proved almost sure convergence for 
real-valued ergodic processes; a simple sandwich argument (Algoet and 
Cover [20]) will be used in Section 16.8 to prove the general AEP. 






CHAPTER 4 

ENTROPY RATES 

OF A STOCHASTIC PROCESS 


The asymptotic equipartition property in Chapter 3 establishes that 
nH(X) bits suffice on the average to describe n independent and iden¬ 
tically distributed random variables. But what if the random variables 
are dependent? In particular, what if the random variables form a sta¬ 
tionary process? We will show, just as in the i.i.d. case, that the entropy 
H(X\, X 2 ,..., X n ) grows (asymptotically) linearly with n at a rate H(X), 
which we will call the entropy rate of the process. The interpretation of 
H(X) as the best achievable data compression will await the analysis in 
Chapter 5. 


4.1 MARKOV CHAINS 


A stochastic process {X/} is an indexed sequence of random variables. 
In general, there can be an arbitrary dependence among the random vari¬ 
ables. The process is characterized by the joint probability mass functions 
Pr{(Xi, X 2 ,..., X n ) = (xi,x 2 , ...,x n )}= p(x u x 2 , 
x n ) G P^ 1 for n = 1,2,.... 

Definition A stochastic process is said to be stationary if the joint 
distribution of any subset of the sequence of random variables is invariant 
with respect to shifts in the time index; that is, 

— -^•l» ^2 ~~ -^-2? • • • ? ~ 

=Pr{^l+/ — X^2-\-l — X2-, • • . ， ^n-\-l = ^n} (4.1) 

for every n and every shift / and for all x\, X 2 , •. •, x n e X. 
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A simple example of a stochastic process with dependence is one in 
which each random variable depends only on the one preceding it and 
is conditionally independent of all the other preceding random variables. 
Such a process is said to be Markov. 

Definition A discrete stochastic process X\, X 2 , ... is said to be a 
Markov chain or a Markov process if for n = 1,2,..., 

Pr(^Gi+l = In+l \^n = ^-rii ^n—l = 1» • • • » -^1 = -^l) 

=Pr (X w +i = = Xfi) (4.2) 

for all x\, X 2 ,..., x n , x n+ \ G X. 

In this case, the joint probability mass function of the random variables 
can be written as 

pOi ， A, •••，〜）= p(xi)p(x 2 \xi)p(x3\x 2 ) • - p(x n \x n -i). (4.3) 

Definition The Markov chain is said to be time invariant if the con¬ 
ditional probability p(x n+ i\x n ) does not depend on n\ that is, for n = 
1 , 2 ,..., 

Pr{X /l+ i = b\X n =a}= Pr{X 2 = b\X { = a} for all a,b e X. (4.4) 

We will assume that the Markov chain is time invariant unless otherwise 
stated. 

If {Xi} is a Markov chain, X n is called the state at time n. A time- 
invariant Markov chain is characterized by its initial state and a probability 
transition matrix P = [Pij], /, j G {1,2,..., m}, where Pa = Pr{X n+ \ = 
j\X n = i). 

If it is possible to go with positive probability from any state of the 
Markov chain to any other state in a finite number of steps, the Markov 
chain is said to be irreducible. If the largest common factor of the lengths 
of different paths from a state to itself is 1, the Markov chain is said to 
aperiodic. 

If the probability mass function of the random variable at time n is 
p(x n ), the probability mass function at time n + l is 

Pi?^n-]r\) — 〉 ： P(〜) Px n x n +\ • (4.5) 

X n 


A distribution on the states such that the distribution at time n + l is the 
same as the distribution at time n is called a stationary distribution. The 
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stationary distribution is so called because if the initial state of a Markov 
chain is drawn according to a stationary distribution, the Markov chain 
forms a stationary process. 

If the finite-state Markov chain is irreducible and aperiodic, the sta¬ 
tionary distribution is unique, and from any starting distribution, the 
distribution of X n tends to the stationary distribution as n —> oc. 

Example 4. 7.7 Consider a two-state Markov chain with a probability 
transition matrix 



L ^ l ~ 

as shown in Figure 4.1. 

Let the stationary distribution be represented by a vector /x whose com¬ 
ponents are the stationary probabilities of states 1 and 2, respectively. Then 
the stationary probability can be found by solving the equation [iP = \i 
or, more simply, by balancing probabilities. For the stationary distribution, 
the net probability flow across any cut set in the state transition graph is 
zero. Applying this to Figure 4.1, we obtain 

\i\a = /X2/3. 

Since + = 1, the stationary distribution is 

P a 

Mi = — M2 = ― —• 
a + ^ a + ^ 

If the Markov chain has an initial state drawn according to the stationary 
distribution, the resulting process will be stationary. The entropy of the 


(4.7) 


(4.8) 


P 




1 - a a 1 - P 



State 1 


FIGURE 4.1. Two-state Markov chain. 


State 2 
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state X n at time n is 


H(X n ) = 


_± _ 

a + P' a + P) 


(4.9) 


However, this is not the rate at which entropy grows for H(X\, X 2 ,..., 
X n ). The dependence among the X"s will take a steady toll. 


4.2 ENTROPY RATE 


If we have a sequence of n random variables, a natural question to ask 
is: How does the entropy of the sequence grow with nl We define the 
entropy rate as this rate of growth as follows. 


Definition The entropy of a stochastic process {X/} is defined by 

1 

H(X)= lim -H(X U X 2 , •. • ， X n ) (4.10) 

n—oon 


when the limit exists. 


We now consider some simple examples of stochastic processes and 
their corresponding entropy rates. 


1. Typewriter. 

Consider the case of a typewriter that has m equally likely output 
letters. The typewriter can produce m n sequences of length n, all 
of them equally likely. Hence H(X\, X 2 , … ， X n ) = log m n and the 
entropy rate is H(X) = logm bits per symbol. 

2. X\, X 2 ,...are i.i.d. random variables• Then 


H(X)= lim 


H(X U X 2 , … ， x n ) 
n 




n 


(4.11) 


which is what one would expect for the entropy rate per symbol. 

3. Sequence of independent but not identically distributed random vari¬ 
ables. In this case, 

n 

H(X u X 2 ,...,X n ) = Y / H(X i ), (4.12) 

i=l 

but the H(XiYs are all not equal. We can choose a sequence of dis¬ 
tributions on X\, X 2 , ... such that the limit of ^ ^ H(Xi) does not 
exist. An example of such a sequence is a random binary sequence 






4.2 ENTROPY RATE 75 


where pt = P(Xi = 1) is not constant but a function of /, chosen 
carefully so that the limit in (4.10) does not exist. For example, let 


Pi = 


0.5 if 2k < log log i < 2k + l, 

0 if 2^ + 1 < log log i <2k + 2 


(4.13) 


for A: = 0, 1, 2,.... 

Then there are arbitrarily long stretches where H{Xi) = 1, followed 
by exponentially longer segments where H(Xf) = 0. Hence, the run¬ 
ning average of the H(Xi) will oscillate between 0 and 1 and will 
not have a limit. Thus, H(X) is not defined for this process. 


We can also define a related quantity for entropy rate: 


H\X) = lim H(X n \X n . u X n _ 2 , (4.14) 

n—oo 


when the limit exists. 

The two quantities H(X) and H\X) correspond to two different notions 
of entropy rate. The first is the per symbol entropy of the n random vari¬ 
ables, and the second is the conditional entropy of the last random variable 
given the past. We now prove the important result that for stationary pro¬ 
cesses both limits exist and are equal. 

Theorem 4.2.1 For a stationary stochastic process, the limits in (4.10) 
and (4.14) exist and are equal: 

H(X) = H\X). (4.15) 

We first prove that \imH{X n \X n -\,..., X\) exists. 

Theorem 4.2.2 For a stationary stochastic process, H(X n \X n -\,..., 
X\) is nonincreasing in n and has a limit H\X). 

Proof 


H(X n+l \X u X 2 , ..., X n ) < H(X n+l \X n , • • • ， X 2 ) (4.16) 

=H(X n \X n . l ,...,X l ), (4.17) 

where the inequality follows from the fact that conditioning reduces en¬ 
tropy and the equality follows from the stationarity of the process. Since 
H(X n \X n -\, ..., X\) is a decreasing sequence of nonnegative numbers, 
it has a limit, H\X). □ 
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We now use the following simple result from analysis. 

Theorem 4.2.3 (Cesdro mean) If a n —> a and b n = ^ a/, then 

b n ~> ci. 


Proof: {Informal outline). Since most of the terms in the sequence {ak} 
are eventually close to a, then b n , which is the average of the first n terms, 
is also eventually close to a. 


Formal Proof: Let 6 > 0. Since a n ^ a, there exists a number 
such that \a n — a\ < e for all n > N(e). Hence, 


\b n _ a| 


< 


< 


- — a) 

i=l 

(4.18) 

n 

- 1(^/ _ a )\ 
i—\ 

(4.19) 

N(€) 、 

i ' ' n _ N 

/ \ai a\ + 6 

% n 

i=l 

(4.20) 



- 〉 ' \cii — a\ € 

(4.21) 


for all n > N(e). Since the first term goes to 0 as n ^ oo, we can make 
\b n — a\ < 2e by taking n large enough. Hence, & > a as n —^ oc. □ 

Proof of Theorem 4.2.1 : By the chain rule, 

H(X u X 2 ,...,X n ) 1 

n n 

i=\ 


(4.22) 


that is, the entropy rate is the time average of the conditional entropies. 
But we know that the conditional entropies tend to a limit H f • Hence, by 
Theorem 4.2.3, their running average has a limit, which is equal to the 
limit H f of the terms. Thus, by Theorem 4.2.2, 

H(X) =lim //(Xl，Z2， '" ，Zn) =\imH(X n \X n _i,...,X l ) 


=H\X). 


n 


□ (4.23) 
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The significance of the entropy rate of a stochastic process arises from 
the AEP for a stationary ergodic process. We prove the general AEP in 
Section 16.8, where we show that for any stationary ergodic process, 

--log p(X u X 2 , ..., X n ) H{X) (4.24) 

n 

with probability 1. Using this, the theorems of Chapter 3 can easily be 
extended to a general stationary ergodic process. We can define a typical 
set in the same way as we did for the i.i.d. case in Chapter 3. By the 
same arguments, we can show that the typical set has a probability close 
to 1 and that there are about 2 nH ^ typical sequences of length n, each 
with probability about 2~ nH ^\ We can therefore represent the typical 
sequences of length n using approximately nH(X) bits. This shows the 
significance of the entropy rate as the average description length for a 
stationary ergodic process. 

The entropy rate is well defined for all stationary processes. The entropy 
rate is particularly easy to calculate for Markov chains. 

Markov Chains. For a stationary Markov chain, the entropy rate is 
given by 

H{X) = H\X) = \imH{X n \X n ^ = limH(X n \X n ^) 

=H{X 2 \X x ), (4.25) 

where the conditional entropy is calculated using the given stationary 
distribution. Recall that the stationary distribution \x is the solution of the 
equations 

fij = ^ Pij for all j. (4.26) 

i 

We express the conditional entropy explicitly in the following theorem. 

Theorem 4.2.4 Let {X/} be a stationary Markov chain with station¬ 
ary distribution \i and transition matrix P. Let 〜 jji. Then the entropy 
rate is 

H(X) = -J2^ p U l °SPij- (4.27) 

ij 

Proof: H(X) = H(X 2 \X l ) = J ： if x i (j：j -P i； logP, 7 ). □ 
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Example 4.2.1 (Two-state Markov chain) The entropy rate of the two- 
state Markov chain in Figure 4.1 is 

H{PC) = H(X 2 \X l ) = — ⑻ + 剛 . (4.28) 

Remark If the Markov chain is irreducible and aperiodic, it has a unique 
stationary distribution on the states, and any initial distribution tends to 
the stationary distribution as n —> oc. In this case, even though the initial 
distribution is not the stationary distribution, the entropy rate, which is 
defined in terms of long-term behavior, is H(X), as defined in (4.25) and 
(4.27). 

4.3 EXAMPLE: ENTROPY RATE OF A RANDOM WALK 
ON A WEIGHTED GRAPH 

As an example of a stochastic process, let us consider a random walk on 
a connected graph (Figure 4.2). Consider a graph with m nodes labeled 
{1,2,..., m}, with weight Wij > 0 on the edge joining node i to node 
j • (The graph is assumed to be undirected, so that Wij = Wji ，We set 
Wij = 0 if there is no edge joining nodes i and j.) 

A particle walks randomly from node to node in this graph. The ran¬ 
dom walk {X n }, X n e {1,2,..., m}, is a sequence of vertices of the 
graph. Given X n = i ，the next vertex j is chosen from among the nodes 
connected to node i with a probability proportional to the weight of the 
edge connecting i to j. Thus, 尸 " = / Wik. 


2 



FIGURE 4.2. Random walk on a graph. 
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In this case, the stationary distribution has a surprisingly simple form, 
which we will guess and verify. The stationary distribution for this Markov 
chain assigns probability to node i proportional to the total weight of the 
edges emanating from node i. Let 

W U (4.29) 

j 

be the total weight of edges emanating from node /, and let 

w = W U (4.30) 


be the sum of the weights of all the edges. Then Wi = 2W. 
We now guess that the stationary distribution is 



Wi 

/X/ —- . 

内 2W 

(4.31) 

We verify that this is the stationary distribution by checking that /xP = \x. 
Here 

工 ^i P ij = 

i 

~ 2^ 2W w t 

l 

(4.32) 


i 

(4.33) 


_ Wj 
~ 2W 

(4.34) 


~ / X 7* 

(4.35) 


Thus, the stationary probability of state i is proportional to the weight of 
edges emanating from node i. This stationary distribution has an inter¬ 
esting property of locality: It depends only on the total weight and the 
weight of edges connected to the node and hence does not change if the 
weights in some other part of the graph are changed while keeping the 
total weight constant. We can now calculate the entropy rate as 

H(X) = H(X 2 \X { ) (4.36) 

=-E 队 E Pij log P u (4.37) 
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(4.38) 



(4.39) 



(4.40) 


=H 




2W 


- H 



為’…) 


(4.41) 


If all the edges have equal weight, the stationary distribution puts 
weight E[ /2E on node Z, where E[ is the number of edges emanating 
from node i and E is the total number of edges in the graph. In this case, 
the entropy rate of the random walk is 



This answer for the entropy rate is so simple that it is almost mislead¬ 
ing. Apparently, the entropy rate, which is the average transition entropy, 
depends only on the entropy of the stationary distribution and the total 
number of edges. 

Example 4.3.1 (Random walk on a chessboard) Let a king move at 
random on an 8 x 8 chessboard. The king has eight moves in the interior, 
five moves at the edges, and three moves at the comers. Using this and 
the preceding results, the stationary probabilities are, respectively, 為， 
and and the entropy rate is 0.92 log 8. The factor of 0.92 is due 
to edge effects; we would have an entropy rate of log 8 on an infinite 
chessboard. 

Similarly, we can find the entropy rate of rooks (log 14 bits, since the 
rook always has 14 possible moves), bishops, and queens. The queen 
combines the moves of a rook and a bishop. Does the queen have more 
or less freedom than the pair? 


Remark It is easy to see that a stationary random walk on a graph is 
time-reversible . ， that is, the probability of any sequence of states is the 


same forward or backward: 
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Pr(-^1 — ^2 ~ • • • » ~~ ^n) 

—■ -- X\, X n —\ -- ■— ^n)• (4.43) 

Rather surprisingly, the converse is also true; that is, any time-reversible 
Markov chain can be represented as a random walk on an undirected 
weighted graph. 

4.4 SECOND LAW OF THERMODYNAMICS 

One of the basic laws of physics, the second law of thermodynamics, 
states that the entropy of an isolated system is nondecreasing. We now 
explore the relationship between the second law and the entropy function 
that we defined earlier in this chapter. 

In statistical thermodynamics, entropy is often defined as the log of 
the number of microstates in the system. This corresponds exactly to our 
notion of entropy if all the states are equally likely. But why does entropy 
increase? 

We model the isolated system as a Markov chain with transitions obey¬ 
ing the physical laws governing the system. Implicit in this assumption is 
the notion of an overall state of the system and the fact that knowing the 
present state, the future of the system is independent of the past. In such 
a system we can find four different interpretations of the second law. It 
may come as a shock to find that the entropy does not always increase. 
However, relative entropy always decreases. 

1. Relative entropy D(/x n ||/x^) decreases with n. Let \i n and ii f n be two 
probability distributions on the state space of a Markov chain at time 
n, and let and /x^ 7+1 be the corresponding distributions at time 
n + l. Let the corresponding joint mass functions be denoted by 
p and q. Thus, p(x n ,x n+ i) = p(x n )r(x n+ i\x n ) and q(x n ,x n+ i)= 
^(x n )r(x n+ \\x n ), where r(.|.) is the probability transition function 
for the Markov chain. Then by the chain rule for relative entropy, 
we have two expansions: 

D(p(x n , x n+ i)\\q(x n , x n+ i)) = D{p{xn)\\q{x n )) 


+ D(p(x n+ i\x n ) I \q{x n+ i \x n )) 
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= D{p{x n+ x)\\q{x n+ x)) 

+ D{p{x n \x n+ \)\\q{x n \x n+ i)). 

Since both p and q are derived from the Markov chain, the con¬ 
ditional probability mass functions p(x n+ \\x n ) and q{x n+ \\x n ) are 
both equal to r(x n+ i\x n ), and hence D(p{x n+ \\x n )\\q{x n+ \\x n )) = 0. 
Now using the nonnegativity of D(p(x n \x n+ i)\\q(x n \x n+ i)) (Corol¬ 
lary to Theorem 2.6.3), we have 


D(p(x n )\\q(x n )) > D(p{x n+ i)\\q{x n+ i)) (4.44) 

or 

D^nW^'n) > D{n ， n+x \\ix' n+] ). (4.45) 


Consequently, the distance between the probability mass functions 
is decreasing with time n for any Markov chain. 

An example of one interpretation of the preceding inequality is 
to suppose that the tax system for the redistribution of wealth is 
the same in Canada and in England. Then if /x n and n! n represent 
the distributions of wealth among people in the two countries, this 
inequality shows that the relative entropy distance between the two 
distributions decreases with time. The wealth distributions in Canada 
and England become more similar. 

2. Relative entropy D(/x n \\/x) between a distribution \i n on the states at 
time n and a stationary distribution \i decreases with n. In (4.45), 
[i' n is any distribution on the states at time n. If we let ii f n be any 
stationary distribution /x, the distribution /x^ +1 at the next time is 
also equal to \i. Hence, 

> D(/x n+ i||/x), (4.46) 

which implies that any state distribution gets closer and closer to 
each stationary distribution as time passes. The sequence D(/x„||/x) 
is a monotonically nonincreasing nonnegative sequence and must 
therefore have a limit. The limit is zero if the stationary distribution 
is unique, but this is more difficult to prove. 

3. Entropy increases if the stationary distribution is uniform. In gen¬ 
eral, the fact that the relative entropy decreases does not imply that 
the entropy increases. A simple counterexample is provided by any 
Markov chain with a nonuniform stationary distribution. If we start 
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this Markov chain from the uniform distribution, which already is 
the maximum entropy distribution, the distribution will tend to the 
stationary distribution, which has a lower entropy than the uniform. 
Here, the entropy decreases with time. 

If, however, the stationary distribution is the uniform distribution, 
we can express the relative entropy as 


DOl^IIaO = log |A1 — H{ l x n ) = \og\X\ - H(X n ). (4.47) 

In this case the monotonic decrease in relative entropy implies a 
monotonic increase in entropy. This is the explanation that ties in 
most closely with statistical thermodynamics, where all the micro¬ 
states are equally likely. We now characterize processes having a 
uniform stationary distribution. 

Definition A probability transition matrix [Pij], P" = Pr{X n+ \ = 
j\X n = i}, is called doubly stochastic if 

[Pij = 1 ， j = 1 , 2 , ... 

i 

and 

^ P" = 1 ， Z = 1 ， 2, •… 

j 

Remark The uniform distribution is a stationary distribution of P if 
and only if the probability transition matrix is doubly stochastic (see 
Problem 4.1). 

4. The conditional entropy H{X n \X\) increases with n for a station¬ 
ary Markov process• If the Markov process is stationary, H(X n ) is 
constant. So the entropy is nonincreasing. However, we will prove 
that H(X n \X\) increases with n. Thus, the conditional uncertainty 
of the future increases. We give two alternative proofs of this result. 
First, we use the properties of entropy, 

H(X n \X\) > H(X n \X\, X 2 ) (conditioning reduces entropy) 

(4.50) 

= H(X n \X 2 ) (by Markovity) (4.51) 

= H(X n -\\X\) (by stationarity). (4.52) 


(4.48) 


(4.49) 
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Alternatively, by an application of the data-processing inequality to 
the Markov chain X\ —> X n -\ —> X n , we have 

I{X V ,X n _ x ) > I{X,-X n ). (4.53) 

Expanding the mutual informations in terms of entropies, we have 

H(X n —0- H{X n . x \X x )> H{X n ) - H{X n \X x ). (4.54) 

By stationarity, H(X n -\) = H(X n )， and hence we have 

HiXn^X^sHiX^X,). (4.55) 

[These techniques can also be used to show that H(Xo\X n ) is 
increasing in n for any Markov chain.] 

5. Shuffles increase entropy. If T is a shuffle (permutation) of a deck 
of cards and X is the initial (random) position of the cards in the 
deck, and if the choice of the shuffle T is independent of X, then 

H{TX) > H(X )， (4.56) 

where TX is the permutation of the deck induced by the shuffle T 
on the initial permutation X. Problem 4.3 outlines a proof. 

4.5 FUNCTIONS OF MARKOV CHAINS 

Here is an example that can be very difficult if done the wrong 
way. It illustrates the power of the techniques developed so far. Let 
X\, X 2 ,..., X n ,... be a stationary Markov chain, and let Y[ = 0(X/) be 
a process each term of which is a function of the corresponding state 
in the Markov chain. What is the entropy rate H{y)l Such functions of 
Markov chains occur often in practice. In many situations, one has only 
partial information about the state of the system. It would simplify matters 
greatly if Fi, F 2 , …， h also formed a Markov chain, but in many cases, 
this is not true. Since the Markov chain is stationary, so is h ， … ， h ， 
and the entropy rate is well defined. However, if we wish to compute 
H{y), we might compute H(Y n \Y n -\,..., Y\) for each n and find the 
limit. Since the convergence can be arbitrarily slow, we will never know 
how close we are to the limit. (We can’t look at the change between the 
values at n and n + 1, since this difference may be small even when we 
are far away from the limit — consider, for example, X!;.) 
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It would be useful computationally to have upper and lower bounds con¬ 
verging to the limit from above and below. We can halt the computation 
when the difference between upper and lower bounds is small, and we 
will then have a good estimate of the limit. 

We already know that H{Y n \Y n -\,... ,Y\) converges monoton- 
ically to H{y) from above. For a lower bound, we will use 
H(Y n \Y n -\,..., Fi, X\). This is a neat trick based on the idea that X\ 
contains as much information about Y n as Y\, Yo, F_i,_ 

Lemma 4.5.1 

H(Y n \Y n ^ l ,...,Y 2 ,X l ) < H{y). (4.57) 

Proof: We have for k = 1,2,..., 

H{Y n \Y n ^ u ...,Y 2 , X x ) ( ^ H(Y n \Y n _ u ..., Y 2 , Y { , X x ) (4.58) 

=H{Y n \Y n _ u ...,Y U X U X 0 , X— k ) 




(4.59) 

= H(Y n \Y n _ u . 



X—k, Yq, .. _ 

,Y-k) 

(4.60) 

(d) 

< H(Y n \Y n _ u . 

• • » -^1 ? -^0 ， . . . ， ^ - k ) 

(4.61) 

=H{Y n+k+l \Y n+k ,...,Y{), 

(4.62) 


where (a) follows from that fact that Y\ is a function of X\, and (b) follows 
from the Markovity of X, (c) follows from the fact that Yi is a function 
of Xi, (d) follows from the fact that conditioning reduces entropy, and (e) 
follows by stationarity. Since the inequality is true for all k, it is true in 
the limit. Thus, 

H(Y n \Y n . u ...,Fi,Xi)< UmH(Y n+M \Y n+k , ..., 7i) (4.63) 

k 

= H(y). □ (4.64) 

The next lemma shows that the interval between the upper and the 
lower bounds decreases in length. 

Lemma 4.5.2 

H(Y n \Y n ^, ..., Fi) - H(Y n \Y n - U • • • ， L A) — 0. (4.65) 
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Proof: The interval length can be rewritten as 

H(Y n \Y n . u … ， h) - H(Y n \Y n ^ u ...,yi,X0 

=I(X l ;Y n \Y n . u ...,Y l ). (4.66) 

By the properties of mutual information, 

I(XuY u Y 2 ,...,Y n )<H(X l ), (4.67) 


and /(Xi; Fi, F 2 ,. • • ， Y n ) increases with n. Thus, lim/(Xi ； Y\,Y 2 ,, 
Y n ) exists and 


lim I(X l] Y l ,Y 2 ,..., 

OO 

Yn) < H{X,). 

(4.68) 

By the chain rule, 



H(X) > lim I(X l ;Y u Y 2 ,..., 

00 

Y n ) 

(4.69) 

n 

=lim 

i=\ 

， … ， A) 

(4.70) 

00 

= ，…， 

A). 

(4.71) 


Since this infinite sum is finite and the terms are nonnegative, the terms 
must tend to 0; that is, 

HmI(XuY n \Y n _ u ...,Y l )=0, (4.72) 

which proves the lemma. □ 

Combining Lemmas 4.5.1 and 4.5.2, we have the following theorem. 

Theorem 4.5.1 IfX\, X 2 ,..., X n form a stationary Markov chain, and 
Y[ = then 

H(Y n \Y n . u ...,Fi,Xi)< H(y) < H(Y n \Y n . u (4.73) 

and 

lim H(Y n \Y n . u ...,Y U X { ) = H(y) = \imH(Y n \Y n . u ... (4.74) 

In general, we could also consider the case where 7/ is a stochastic 
function (as opposed to a deterministic function) of Xi . Consider a Markov 
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process X\, X 2 ,..., X n , and define a new process Y\,Y 2 ,, Y n , where 
each Yi is drawn according to p(yi\xi), conditionally independent of all 
the other Xj , j ♦ i., that is, 


n— 1 n 

p(x n , y n ) = p(xi) ]""[ p(x i+ i\Xi)Y\ PiyiUi). (4.75) 

i=\ i=l 

Such a process, called a hidden Markov model (HMM), is used extensively 
in speech recognition, handwriting recognition, and so on. The same argu¬ 
ment as that used above for functions of a Markov chain carry over to 
hidden Markov models, and we can lower bound the entropy rate of a 
hidden Markov model by conditioning it on the underlying Markov state. 
The details of the argument are left to the reader. 


SUMMARY 

Entropy rate. Two definitions of entropy rate for a stochastic process 
are 

H{X) = lim -H(X u X 2 ,...,X n ), 

n^-oo n 

H\X) = lim H{X n \X n _ x , X n _ 2 , ... ， ZO, 

OO 

For a stationary stochastic process, 

H(X) = H\X). 

Entropy rate of a stationary Markov chain 

Pi j log Pij. 

U 

Second law of thermodynamics. For a Markov chain: 

1. Relative entropy D(/x„||/x^) decreases with time 

2. Relative entropy /)(/x„||/x) between a distribution and the stationary 
distribution decreases with time. 

3. Entropy H(X n ) increases if the stationary distribution is uniform. 


(4.76) 

(4.77) 

(4.78) 

(4.79) 
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4. The conditional entropy H{X n \X\) increases with time for a sta¬ 
tionary Markov chain. 

5. The conditional entropy H(Xo\X n ) of the initial condition Xo in¬ 
creases for any Markov chain. 

Functions of a Markov chain. If X\, X 2 ,..., form a stationary 
Markov chain and Y[ = 4>(Xi), then 

H(Y n \Y n . u ...,Fi,Xi)< H(y)< H(Y n \Y n . u ..., 7i) (4.80) 

and 

lim H(Y n \Y n ^ u ..., Fi, X x ) = H(y) = lim H{Y n \Y n ^ .. ^Y x ). 

n^-oo n^-oo 

(4.81) 


PROBLEMS 


4.1 Doubly stochastic matrices. An n x n matrix P = [Pij] is said 
to be doubly stochastic if Pij > 0 and Pij = 1 for all i and 
^ z - Pij = 1 for all j. An n x n matrix P is said to be a permu¬ 
tation matrix if it is doubly stochastic and there is precisely one 
Pij = 1 in each row and each column. It can be shown that every 
doubly stochastic matrix can be written as the convex combination 
of permutation matrices. 

(a) Let 2 ! = (ai, « 2 ,..., a n ), ai > 0, X] a/ = 1, be a probability 
vector. Let b = aP, where P is doubly stochastic. Show that b 
is a probability vector and that H(b\, Z? 2 ,.. •, b n ) > H(a\, a 2 , 
..., a n ). Thus, stochastic mixing increases entropy. 

(b) Show that a stationary distribution /x for a doubly stochastic 
matrix P is the uniform distribution. 

(c) Conversely, prove that if the uniform distribution is a stationary 
distribution for a Markov transition matrix P, then P is doubly 
stochastic. 

4.2 Time’s arrow . Let {X/}^ : _ 00 be a stationary stochastic process. 
Prove that 


H(X 0 \X^ U X— 2 , … ， X^ n ) = H(X 0 \X U X 2 , … ， X n ). 
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In other words, the present has a conditional entropy given the past 
equal to the conditional entropy given the future. This is true even 
though it is quite easy to concoct stationary random processes for 
which the flow into the future looks quite different from the flow 
into the past. That is, one can determine the direction of time by 
looking at a sample function of the process. Nonetheless, given 
the present state, the conditional uncertainty of the next symbol in 
the future is equal to the conditional uncertainty of the previous 
symbol in the past. 

4.3 Shuffles increase entropy . Argue that for any distribution on shuf¬ 
fles T and any distribution on card positions X that 


(4.82) 

(4.83) 

(4.84) 

(4.85) 


H{TX) > H(TX\T) 


= H(T~ l TX\T) 
=H{X\T) 

= H(X) 


if X and T are independent. 

4.4 Second law of thermodynamics• Let X 2 , ... be a station¬ 

ary first-order Markov chain. In Section 4.4 it was shown that 
H(X n I X { ) > H(X n ^i I X x ) for n = 2,3,.... Thus, conditional 
uncertainty about the future grows with time. This is true although 
the unconditional uncertainty H{X n ) remains constant. However, 
show by example that H{X n \X\ = x\) does not necessarily grow 
with n for every x\. 

4.5 Entropy of a random tree. Consider the following method of gen¬ 
erating a random tree with n nodes. First expand the root node: 



Then expand one of the two terminal nodes at random: 



At time k, choose one of the k — l terminal nodes according to a 
uniform distribution and expand it. Continue until n terminal nodes 
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have been generated. Thus, a sequence leading to a five-node tree 
might look like this: 



Surprisingly, the following method of generating random trees 
yields the same probability distribution on trees with n termi¬ 
nal nodes. First choose an integer N\ uniformly distributed on 
{1,2,..., n — 1}. We then have the picture 



n - N' 


Then choose an integer N 2 uniformly distributed over 
{1, 2,..., A^i — 1}, and independently choose another integer N^, 
uniformly over {1, 2,..., (n — A^i) — 1}. The picture is now 



- N 2 N 3 n _ N、_ N 3 


Continue the process until no further subdivision can be made. 
(The equivalence of these two tree generation schemes follows, for 
example, from Polya’s urn model.) 

Now let T n denote a random n-node tree generated as described. The 
probability distribution on such trees seems difficult to describe, but 
we can find the entropy of this distribution in recursive form. 

First some examples. For n = 2, we have only one tree. Thus, 
H(T 2 ) = 0. For n = 3, we have two equally probable trees: 
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Thus, H[D = log 2. For n = 4, we have five possible trees, with 
probabilities 4, k 


Now for the recurrence relation. Let N\{T n ) denote the number of 
terminal nodes of T n in the right half of the tree. Justify each of 

the steps 

in the following: 


H(T n ) 

= _uTn) 

(4.86) 


=HiN^ + HiUNi) 

(4.87) 


=login -l) + H(T n \N l ) 

(4.88) 


1 n—\ 

- ^g(n — 1) + ^-V (H(T k ) + H(T n - k Y) 
n — \ 

k=\ 

(4.89) 


2 n _i 

= log(« - 1) + ^ tY] H{T k ) 

k=\ 

(4.90) 


2 «-i 

=logo l) + y^ H k- 

n — l 二 ^ 

(4.91) 


k=i 


(f) Use this to show that 

in- \)H n = nHn.x + (n - 1) login - 1) - (n - 2) login - 2) 

(4.92) 

or 

— =(4.93) 

for appropriately defined c n . Since ^c n = c < oo, you have proved 
that ^H(T n ) converges to a constant. Thus, the expected number of 
bits necessary to describe the random tree T n grows linearly with n. 

4.6 Monotonicity of entropy per element. For a stationary stochastic 
process X\, X 2 , ..., X n , show that 

(a) 

H(X U X 2 , ..., X„) ^ H(X U X 2 , …， X„_i) “ 

^ \ . (/r.y^r) 

n n — 1 

(b) 

ff(X l ,X 2 ,...,X n ) 

^Zi). (4.95) 
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4.7 Entropy rates of Markov chains 

(a) Find the entropy rate of the two-state Markov chain with tran¬ 
sition matrix 

p = 1 - Poi Poi 

_ PlO 1 — Pio 

(b) What values of poi, pio maximize the entropy rate? 

(c) Find the entropy rate of the two-state Markov chain with tran¬ 
sition matrix 

P _[ i~P P " 

L 1 0 • 

(d) Find the maximum value of the entropy rate of the Markov 
chain of part (c). We expect that the maximizing value of p 
should be less than since the 0 state permits more informa¬ 
tion to be generated than the 1 state. 

(e) Let N(t) be the number of allowable state sequences of length t 
for the Markov chain of part (c). Find N{t) and calculate 

1 

Ho = lim -logTV ⑴. 

t^OO t 

[Hint: Find a linear recurrence that expresses N(t) in terms 
of — 1) and N(t — 2). Why is Ho an upper bound on the 
entropy rate of the Markov chain? Compare Ho with the max¬ 
imum entropy found in part (d).] 

4.8 Maximum entropy process. A discrete memory less source has the 
alphabet {1,2}，where the symbol 1 has duration 1 and the sym¬ 
bol 2 has duration 2. The probabilities of 1 and 2 are p\ and p 2 , 
respectively. Find the value of p\ that maximizes the source entropy 
per unit time H(X) = What is the maximum value H(X)1 

4.9 Initial conditions • Show, for a Markov chain, that 

H(X 0 \X n ) > //(Z 0 |X„_i). 

Thus, initial conditions Xo become more difficult to recover as the 
future X n unfolds. 

4.10 Pairwise independence. Let X 2 , ..., X n ^\ be i.i.d. random 
variables taking values in {0, 1}, with Pr{X/ = 1} = [ Let X n = l 
if Y^=\ i s odd and X n = 0 otherwise. Let n > 3. 
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(a) Show that X/ and Xj are independent for i ^ j, /, j g {1, 2, 
..., vi\. 

(b) Find H(X“ Xj) for i ^ j. 

(c) Find H(X U X 2 , …, X n ). Is this equal to nH{X x )l 

4.11 Stationary processes. Let . .., X_i, Xq, X\, .. .be a stationary 
(not necessarily Markov) stochastic process. Which of the follow¬ 
ing statements are true? Prove or provide a counterexample. 

(a) H(X n \X 0 ) = H(X. n \X 0 ). 

(b) H(X n \X 0 )> HiX^lXo) . 

(c) H(X n \Xi, X2, … ， X n -\, X n+ \) is nonincreasing in n. 

(d) H(X n \X\, … ， X n -\, X n+ \, …， X2n) is nonincreasing in n. 

4.12 Entropy rate of a dog looking for a bone. A dog walks on the 
integers, possibly reversing direction at each step with probability 
p = 0.1. Let Xo = 0. The first step is equally likely to be positive 
or negative. A typical walk might look like this: 


(Xo, ^ 1 , •••) = (0, — 1 ， — 2, — 3, — 4, — 3, — 2, — 1 ， 0, 1， • • •)• 

(a) Find H(XuX 2 , …， X n ). 

(b) Find the entropy rate of the dog. 

(c) What is the expected number of steps that the dog takes before 
reversing direction? 

4.13 The past has little to say about the future. For a stationary stochas¬ 
tic process X\, X 2 , ..., X n , ..., show that 

lim X 2 ,...,X n ; X„+i, X w+2 , …， X 加 ） = 0. (4.96) 

n—oo 2n 

Thus, the dependence between adjacent 行 -blocks of a stationary 
process does not grow linearly with n. 

4.14 Functions of a stochastic process 

(a) Consider a stationary stochastic process X\, X 2 ,..., X n , and 
let Y 2 ,..., Y n be defined by 

Y i =(t>{X i ), / = 1,2,... (4.97) 

for some function 0. Prove that 


H(y) < H(X). 


( 4 . 98 ) 
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(b) What is the relationship between the entropy rates H(Z) and 
H(X) if 

Zi = xlr(X i9 X i+1 ), / = 1,2,... (4.99) 

for some function x/zl 

4.15 Entropy rate. Let {Xi} be a discrete stationary stochastic process 
with entropy rate H(X). Show that 

-H(X n ,..., Xi I X 0 , X_k)4 H(X) (4.100) 

n 

for k = 1,2,.... 

4.16 Entropy rate of constrained sequences • In magnetic recording, the 
mechanism of recording and reading the bits imposes constraints 
on the sequences of bits that can be recorded. For example, to 
ensure proper synchronization, it is often necessary to limit the 
length of runs of 0’s between two Vs. Also, to reduce intersymbol 
interference, it may be necessary to require at least one 0 between 
any two Vs. We consider a simple example of such a constraint. 
Suppose that we are required to have at least one 0 and at most 
two 0’s between any pair of Ts in a sequences. Thus, sequences 
like 101001 and 0101001 are valid sequences, but 0110010 and 
0000101 are not. We wish to calculate the number of valid se¬ 
quences of length n. 

(a) Show that the set of constrained sequences is the same as the 
set of allowed paths on the following state diagram: 



(b) Let Xi(n) be the number of valid paths of length n ending at 
state i. Argue that X(n) = [X\(n) Xiin) (n)] f satisfies the 





following recursion: 
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with initial conditions X(l) 
(c) Let 


[ 110 ]，. 


(4.101) 


A 


(4.102) 


Then we have by induction 

X(n) = AX(n — 1) = A 2 X(" — 2) = •. • = A ； 7 _ 1 X(1). 

(4.103) 

Using the eigenvalue decomposition of A for the case of distinct 
eigenvalues, we can write A = U- l AU, where A is the diag¬ 
onal matrix of eigenvalues. Then A n_1 = U- l A n - l U. Show 
that we can write 


X ⑻ =^ _1 Yi + X n 2 ~ l Y 2 + ^ _ 1 Y 3 , 


(4.104) 


where Yi, Y 2 , Y 3 do not depend on n. For large n, this sum 
is dominated by the largest term. Therefore, argue that for i = 
1, 2, 3, we have 


-log Xi(n) log A, 
n 


(4.105) 


where 入 is the largest (positive) eigenvalue. Thus, the number 
of sequences of length n grows as 入〃 for large n. Calculate X 
for the matrix A above. (The case when the eigenvalues are 
not distinct can be handled in a similar manner.) 

(d) We now take a different approach. Consider a Markov chain 
whose state diagram is the one given in part (a), but with 
arbitrary transition probabilities. Therefore, the probability tran¬ 
sition matrix of this Markov chain is 


P 


0 


0 1 
a 0 1 

1 0 0 


a 


(4.106) 


Show that the stationary distribution of this Markov chain is 




(4.107) 


a 


a 


a 


a 


⑻ 


0 1 1 



X 2 (n) 

= 

1 0 0 


X 2 (n - 1) 

X 3 ⑻ 


0 1 0 


X 3 (n - 1) 


loo 

C13 

010 
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(e) Maximize the entropy rate of the Markov chain over choices 
of a. What is the maximum entropy rate of the chain? 

(f) Compare the maximum entropy rate in part (e) with log 入 in 
part (c). Why are the two answers the same? 

4.17 Recurrence times are insensitive to distributions• Let Xo, X\, X 2 , 
… be drawn i.i.d •〜 p(x), x e Af = {l, 2,..., mj, and let N be the 
waiting time to the next occurrence of Xo. Thus N = min n {X n = 
Xo). 

(a) Show that EN = m. 

(b) Show that E\ogN < H(X). 

(c) (Optional) Prove part (a) for {X/} stationary and ergodic. 

4.18 Stationary but not ergodic process. A bin has two biased coins, 
one with probability of heads p and the other with probability of 
heads I — p. One of these coins is chosen at random (i.e., with 
probability I) and is then tossed n times. Let X denote the identity 
of the coin that is picked, and let Y\ and Y 2 denote the results of 
the first two tosses. 

(a) Calculate I(Y { ; Y 2 \X). 

(b) Calculate /(X; Y u Y 2 ). 

(c) Let H(y) be the entropy rate of the Y process (the se¬ 
quence of coin tosses). Calculate TLiy). [Hint: Relate this to 

You can check the answer by considering the behavior as p ^ 

4.19 Random walk on graph. Consider a random walk on the following 
graph: 


2 



(a) Calculate the stationary distribution. 
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(b) What is the entropy rate? 


(c) Find the mutual information I{X n+ \\ X n ) assuming that the 
process is stationary. 

4.20 Random walk on chessboard. Find the entropy rate of the Markov 
chain associated with a random walk of a king on the 3 x 3 chess¬ 
board 


1 

2 

3 

4 

5 

6 

7 

8 

9 


What about the entropy rate of rooks, bishops, and queens? There 
are two types of bishops. 

4.21 Maximal entropy graphs • Consider a random walk on a connected 
graph with four edges. 

(a) Which graph has the highest entropy rate? 

(b) Which graph has the lowest? 

4.22 Three-dimensional maze. A bird is lost in a 3 x 3 x 3 cubical 
maze. The bird flies from room to room going to adjoining rooms 
with equal probability through each of the walls. For example, the 
corner rooms have three exits. 

(a) What is the stationary distribution? 

(b) What is the entropy rate of this random walk? 

4.23 Entropy rate. Let {X/} be a stationary stochastic process with 
entropy rate H(X). 

(a) Argue that H{X) < H{X x ). 

(b) What are the conditions for equality? 

4.24 Entropy rates . Let {X/} be a stationary process. Let Y[ = (X“ 
Xi + \). Let Zi = (X 2 i , X 2 /+ 1 ). Let Vi = X:i. Consider the entropy 
rates H(X), H(y), H(Z), and //(V) of the processes {X,},{F/}, 
{Zi}, and {V,}. What is the inequality relationship <, =, or > 
between each of the pairs listed below? 

(a) 

(b) H{X)^H(Z). 

(c) 

(d) 

4.25 Monotonicity 

(a) Show that I{X\ Y\,Y 2 ,..., Y n ) is nondecreasing in n. 
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(b) Under what conditions is the mutual information constant for 
all nl 

4.26 Transitions in Markov chains. Suppose that {Z/} forms an irre¬ 
ducible Markov chain with transition matrix P and stationary distri¬ 
bution /x. Form the associated “edge process” {7/} by keeping track 
only of the transitions. Thus, the new process {F/} takes values in 
X x X, and F/ = (Z/_i, X/). For example, 

= 3,2, 8,5,7，... 

becomes 

产 = (0 ， 3) ， (3,2),(2, 8) ， (8, 5) ， (5,7)， .… 

Find the entropy rate of the edge process {Yi}. 

4.27 Entropy rate. Let {ZJ be a stationary {0, l}-valued stochastic 
process obeying 

^+i = A ㊉ A-i ㊉ Z 々 +i ， 

where {Z/} is Bemoulli(p)and ㊉ denotes mod 2 addition. What is 
the entropy rate H(X)1 

4.28 Mixture of processes . Suppose that we observe one of two 
stochastic processes but don’t know which. What is the entropy 
rate? Specifically, let X\\, X\ 2 , X\^, ... be a Bernoulli process with 
parameter p\, and let Z 21 , X 22 , ^ 23 »• • - be Bemoulli(/? 2 ). Let 

( 1 with probability j 
2 with probability ^ 

and let Yi = Xq“ / = 1, 2,..., be the stochastic process observed. 
Thus, Y observes the process {Xi/} or {Z 2 /}. Eventually, Y will 
know which. 

(a) Is {Yi} stationary? 

(b) Is {Yi} an i.i.d. process? 

(c) What is the entropy rate H of {F/}? 

(d) Does 

」 logp ⑺， U) — //? 
n 


(e) Is there a code that achieves an expected per-symbol description 
length l -EL n —> HI 
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Now let 6t be Bem(^). Observe that 

Z/ = Z = 1 ， 2, •… 

Thus, 6 is not fixed for all time, as it was in the first part, but is 
chosen i.i.d. each time. Answer parts (a), (b), (c), (d), (e) for the 
process {Z/}，labeling the answers (a’) ， (b’) ，（ c’) ，（ d’) ，（ e’). 

4.29 Waiting times • Let X be the waiting time for the first heads to 
appear in successive flips of a fair coin. For example, Pr{X = 3}= 
(^) 3 . Let S n be the waiting time for the nth head to appear. Thus, 


So = 0 

^«+l ~ H - ^n+1 ， 

where X\, X 2 , X 3 ,... are i.i.d according to the distribution above. 

(a) Is the process {5„} stationary? 

(b) Calculate H(S\, S 2 , … ， S n ). 

(c) Does the process {S n } have an entropy rate? If so, what is it? 
If not, why not? 

(d) What is the expected number of fair coin flips required to 
generate a random variable having the same distribution as S n l 

4.30 Markov chain transitions 


P = [Pij]= 


2 4 

l 1 
4 2 

-55 


4 

1 

4 



Let X\ be distributed uniformly over the states {0, 1,2}. Let 
be a Markov chain with transition matrix P\ thus, P{X n+ \ = 
j\X n =i) = Pij,ij e{0,h2}. 

(a) Is {X n } stationary? 

(b) Find lim n ^ 00 ^H(X u ...,X n ). 

Now consider the derived process Z\, Z 2 , … ， Z n , where 


Zi=Xi 

Z z - = Xi — Xi-\ (mod 3), i = 2,… ， n. 


Thus, Z n encodes the transitions, not the states. 

(c) Find H(Z\, Z 2 , •.. ， Z n ). 

(d) Find H(Z n ) and H(X n ) for n>2. 
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(e) Find //(Z n |Z n _i) for n>2. 

(f) Are Z n —i and Z n independent for n >21 

4.31 Markov. Let {Xi} ~ Bernoulli(p). Consider the associated 
Markov chain {F/}^ =1 , where 

Yi = (the number of Vs in the current run of Vs). For example, if 
X n = 101110..., we have Y n = 101230 •… 

(a) Find the entropy rate of X n . 

(b) Find the entropy rate of Y n . 

4.32 Time symmetry . Let {Z n } be a stationary Markov process. We 
condition on (Xo, X\) and look into the past and future. For what 
index k is 

H{X^ n \X^X x ) = H{X k \X^X x )l 
Give the argument. 

4.33 Chain inequality. Let Xi —> Z 2 — > X 3 ^ X 4 . form a Markov 
chain. Show that 

/(X 1; X 3 ) + /(X 2 ; X 4 ) < /(Xi ； X 4 ) + I(X 2 ； x 3 ). (4.108) 

4.34 Broadcast channel. Let X ^ Y ^ (Z, W) form a Markov chain 
[i.e., p(x, y, z, w) = p(x)p(y\x)p(z, w\y) for allx, y, z, w]. Show 
that 

/(X; Z) + I{X\ W) < /(Z; Y) + /(Z; W). (4.109) 

4.35 Concavity of second law. Let be a stationary Markov 

process. Show that H(X n \Xo) is concave in n. Specifically, show 
that 

H(X n \X 0 ) - H(X n . { \X 0 )- (H(X n . { \X 0 )- H(X n . 2 \X 0 )) 

= X n _!|X 0 ,X n ) <0. (4.110) 

Thus, the second difference is negative, establishing that H(X n \Xo) 
is a concave function of n. 

HISTORICAL NOTES 

The entropy rate of a stochastic process was introduced by Shannon [472], 
who also explored some of the connections between the entropy rate of the 
process and the number of possible sequences generated by the process. 
Since Shannon, there have been a number of results extending the basic 
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theorems of information theory to general stochastic processes. The AEP 
for a general stationary stochastic process is proved in Chapter 16. 

Hidden Markov models are used for a number of applications, such 
as speech recognition [432]. The calculation of the entropy rate for con¬ 
strained sequences was introduced by Shannon [472]. These sequences 
are used for coding for magnetic and optical channels [288]. 



CHAPTER 5 


DATA COMPRESSION 


We now put content in the definition of entropy by establishing the funda¬ 
mental limit for the compression of information. Data compression can be 
achieved by assigning short descriptions to the most frequent outcomes 
of the data source, and necessarily longer descriptions to the less fre¬ 
quent outcomes. For example, in Morse code, the most frequent symbol 
is represented by a single dot. In this chapter we find the shortest average 
description length of a random variable. 

We first define the notion of an instantaneous code and then prove the 
important Kraft inequality, which asserts that the exponentiated codeword 
length assignments must look like a probability mass function. Elemen¬ 
tary calculus then shows that the expected description length must be 
greater than or equal to the entropy, the first main result. Then Shan¬ 
non^ simple construction shows that the expected description length can 
achieve this bound asymptotically for repeated descriptions. This estab¬ 
lishes the entropy as a natural measure of efficient description length. 
The famous Huffman coding procedure for finding minimum expected 
description length assignments is provided. Finally, we show that Huff¬ 
man codes are competitively optimal and that it requires roughly H fair 
coin flips to generate a sample of a random variable having entropy H. 
Thus, the entropy is the data compression limit as well as the number of 
bits needed in random number generation, and codes achieving H turn 
out to be optimal from many points of view. 


5.1 EXAMPLES OF CODES 

Definition A source code C for a random variable X is a mapping from 
X, the range of X, to D*, the set of finite-length strings of symbols from 
a D-ary alphabet. Let C{x) denote the codeword corresponding to x and 
let l(x) denote the length of C(x). 
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For example, C(red) = 00, C(blue) = 11 is a source code for X = {red, 
blue} with alphabet V = {0, 1}. 

Definition The expected length L(C) of a source code C(x) for a ran¬ 
dom variable X with probability mass function p(x) is given by 

MC) = J>(x)Z(x), (5.1) 

where l{x) is the length of the codeword associated with x. 

Without loss of generality, we can assume that the D-ary alphabet is 
{0, 1 ， ••• ， /)- 1 }. 

Some examples of codes follow. 

Example 5. 7 .1 Let X be a random variable with the following distri¬ 
bution and codeword assignment: 

Pr(X = 1 ) = I， codeword C(l) = 0 
Pr(Z = 2) = \, codeword C(2) = 10 
Pr(X = 3) = i, codeword C(3) = 110 
Pr(X = 4) = !， codeword C ⑷ =111. 

The entropy H(X) of X is 1.75 bits, and the expected length L(C)= 
El{X) of this code is also 1.75 bits. Here we have a code that has the 
same average length as the entropy. We note that any sequence of bits 
can be uniquely decoded into a sequence of symbols of X. For example, 
the bit string 0110111100110 is decoded as 134213. 

Example 5 . 7.2 Consider another simple example of a code for a random 
variable: 


Pr(X = 1)= 去 ’ 

codeword C(l) = 0 


MX = 2 ) = i, 

codeword C(2) = 10 

(5.3) 

Pr(X = 3) = i 

codeword C(3) = 11. 



Just as in Example 5.1.1，the code is uniquely decodable. However, in 
this case the entropy is log 3 = 1.58 bits and the average length of the 
encoding is 1.66 bits. Here El(X) > H(X). 


Example 5 . 1.3 {Morse code) The Morse code is a reasonably efficient 
code for the English alphabet using an alphabet of four symbols: a dot, 
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a dash, a letter space, and a word space. Short sequences represent fre¬ 
quent letters (e.g., a single dot represents E) and long sequences represent 
infrequent letters (e.g., Q is represented by “dash ， dash ， dot ， dash”). This is 
not the optimal representation for the alphabet in four symbols — in fact, 
many possible codewords are not utilized because the codewords for let¬ 
ters do not contain spaces except for a letter space at the end of every 
codeword, and no space can follow another space. It is an interesting prob¬ 
lem to calculate the number of sequences that can be constructed under 
these constraints. The problem was solved by Shannon in his original 
1948 paper. The problem is also related to coding for magnetic recording, 
where long strings of 0’s are prohibited [5], [370]. 

We now define increasingly more stringent conditions on codes. Let x n 
denote (x\, 又 2 ,… ， x n ). 

Definition A code is said to be nonsingular if every element of the 
range of X maps into a different string in P*; that is, 

x + x' 今 C(x) ^ C{x'). (5.4) 

Nonsingularity suffices for an unambiguous description of a single 
value of X. But we usually wish to send a sequence of values of X. 
In such cases we can ensure decodability by adding a special symbol 
(a “comma”）between any two codewords. But this is an inefficient use 
of the special symbol; we can do better by developing the idea of self- 
punctuating or instantaneous codes. Motivated by the necessity to send 
sequences of symbols X, we define the extension of a code as follows: 

Definition The extension C* of a code C is the mapping from finite- 
length strings of X to finite-length strings of V, defined by 

C(xix 2 --Xn) = C(xi)C(x 2 ) - - - C(x n ), (5.5) 

where C(x\)C(x 2 ) - - - C(x n ) indicates concatenation of the corresponding 
codewords. 

Example 5.1.4 If C(x\) = 00 and C(X 2 ) = 11， then C(x\X 2 ) = 0011. 

Definition A code is called uniquely decodable if its extension is non¬ 
singular. 

In other words, any encoded string in a uniquely decodable code has 
only one possible source string producing it. However, one may have 
to look at the entire string to determine even the first symbol in the 
corresponding source string. 
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Definition A code is called a prefix code or an instantaneous code if 
no codeword is a prefix of any other codeword. 

An instantaneous code can be decoded without reference to future code¬ 
words since the end of a codeword is immediately recognizable. Hence, 
for an instantaneous code, the symbol X[ can be decoded as soon as we 
come to the end of the codeword corresponding to it. We need not wait 
to see the codewords that come later. An instantaneous code is a self- 
punctuating code., we can look down the sequence of code symbols and 
add the commas to separate the codewords without looking at later sym¬ 
bols. For example, the binary string 01011111010 produced by the code 
of Example 5.1.1 is parsed as 0,10,111,110,10. 

The nesting of these definitions is shown in Figure 5.1. To illustrate the 
differences between the various kinds of codes, consider the examples of 
codeword assignments C(x) to x G A" in Table 5.1. For the nonsingular 
code, the code string 010 has three possible source sequences: 2 or 14 or 
31, and hence the code is not uniquely decodable. The uniquely decodable 
code is not prefix-free and hence is not instantaneous. To see that it is 
uniquely decodable, take any code string and start from the beginning. 
If the first two bits are 00 or 10, they can be decoded immediately. If 



FIGURE 5.1. Classes of codes. 









TABLE 5.1 Classes of Codes 
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X 

Singular 

Nonsingular, But Not 
Uniquely Decodable 

Uniquely Decodable, 
But Not Instantaneous 

Instantaneous 

1 

0 

0 

10 

0 

2 

0 

010 

00 

10 

3 

0 

01 

11 

110 

4 

0 

10 

110 

111 


the first two bits are 11， we must look at the following bits. If the next 
bit is a 1， the first source symbol is a 3. If the length of the string of 
0’s immediately following the 11 is odd, the first codeword must be 110 
and the first source symbol must be 4; if the length of the string of 0，s is 
even, the first source symbol is a 3. By repeating this argument, we can see 
that this code is uniquely decodable. Sardinas and Patterson [455] have 
devised a finite test for unique decodability, which involves forming sets 
of possible suffixes to the codewords and eliminating them systematically. 
The test is described more fully in Problem 5.5.27. The fact that the last 
code in Table 5.1 is instantaneous is obvious since no codeword is a prefix 
of any other. 

5.2 KRAFT INEQUALITY 

We wish to construct instantaneous codes of minimum expected length to 
describe a given source. It is clear that we cannot assign short codewords 
to all source symbols and still be prefix-free. The set of codeword lengths 
possible for instantaneous codes is limited by the following inequality. 

Theorem 5.2.1 (Kraft inequality) For any instantaneous code (prefix 
code) over an alphabet of size D, the codeword lengths … ， l m must 
satisfy the inequality 

< 1. (5.6) 

i 

Conversely，given a set of codeword lengths that satisfy this inequality ， 
there exists an instantaneous code with these word lengths. 

Proof: Consider a D-ary tree in which each node has D children. Let the 
branches of the tree represent the symbols of the codeword. For example, 
the D branches arising from the root node represent the D possible values 
of the first symbol of the codeword. Then each codeword is represented 
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by a leaf on the tree. The path from the root traces out the symbols of the 
codeword. A binary example of such a tree is shown in Figure 5.2. The 
prefix condition on the codewords implies that no codeword is an ancestor 
of any other codeword on the tree. Hence, each codeword eliminates its 
descendants as possible codewords. 

Let / m ax be the length of the longest codeword of the set of codewords. 
Consider all nodes of the tree at level l max . Some of them are codewords, 
some are descendants of codewords, and some are neither. A codeword 
at level // has D lmax ~ li descendants at level / max . Each of these descendant 
sets must be disjoint. Also, the total number of nodes in these sets must 
be less than or equal to D lmax . Hence, summing over all the codewords, 
we have 

^ D ，max_/i ' < D lmm (5.7) 

or 

< 1, (5.8) 


which is the Kraft inequality. 

Conversely, given any set of codeword lengths Zi, Z 2 ,..., / m that sat¬ 
isfy the Kraft inequality, we can always construct a tree like the one in 
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Figure 5.2. Label the first node (lexicographically) of depth l\ as code¬ 
word 1, and remove its descendants from the tree. Then label the first 
remaining node of depth I 2 as codeword 2, and so on. Proceeding this 
way, we construct a prefix code with the specified H ••• ， l m . □ 

We now show that an infinite prefix code also satisfies the Kraft inequal- 

ity- 

Theorem 5.2.2 (Extended Kraft Inequality) For any countably infi¬ 
nite set of codewords that form a prefix code, the codeword lengths satisfy 
the extended Kraft inequality, 

00 

< 1. (5.9) 


Conversely, given any H … satisfying the extended Kraft inequality, 
we can construct a prefix code with these codeword lengths. 

Proof: Let the D-ary alphabet be {0, 1,..., D — 1}. Consider the /th 
codeword y\y 2 - - - yi r Let 0 .y\y2 - - - yi t be the real number given by the 
D-ary expansion 


O.viv 2 - - - yi t =Y^yi D 


This codeword corresponds to the interval 


0 .yiy 2 ---yi i , 0 .yiy 2 ---yi i + 



(5.10) 


(5.11) 


the set of all real numbers whose D-ary expansion begins with 
O.yi J 2 - - - yi r This is a subinterval of the unit interval [0, 1]. By the prefix 
condition, these intervals are disjoint. Hence, the sum of their lengths has 
to be less than or equal to 1. This proves that 

00 

<1. (5.12) 

i=l 


Just as in the finite case, we can reverse the proof to construct the 
code for a given luh,... that satisfies the Kraft inequality. First, reorder 
the indexing so that l\ < /2 < .... Then simply assign the intervals in 



110 


DATA COMPRESSION 


order from the low end of the unit interval. For example, if we wish to 
construct a binary code with /i = 1, / 之 = 2, …， we assign the intervals 
[0, 圣 ) ， [ 士，長 )， ... to the symbols, with corresponding codewords 0 ，10 , 
.... □ 

In Section 5.5 we show that the lengths of codewords for a uniquely 
decodable code also satisfy the Kraft inequality. Before we do that, we 
consider the problem of finding the shortest instantaneous code. 

5.3 OPTIMAL CODES 

In Section 5.2 we proved that any codeword set that satisfies the prefix 
condition has to satisfy the Kraft inequality and that the Kraft inequality 
is a sufficient condition for the existence of a codeword set with the 
specified set of codeword lengths. We now consider the problem of finding 
the prefix code with the minimum expected length. From the results of 
Section 5.2, this is equivalent to finding the set of lengths l\ ， li ， … ， l m 
satisfying the Kraft inequality and whose expected length L = ^ ptU is 
less than the expected length of any other prefix code. This is a standard 
optimization problem: Minimize 

L = J^Pi l i (5-13) 

over all integers l\,h, • • • Jm satisfying 

^D~ li < 1. (5.14) 

A simple analysis by calculus suggests the form of the minimizing /*. 
We neglect the integer constraint on l[ and assume equality in the con¬ 
straint. Hence, we can write the constrained minimization using Lagrange 
multipliers as the minimization of 

Differentiating with respect to //, we obtain 

dJ , t 

—=Pi- XD l> \og e D. 
oil 

Setting the derivative to 0, we obtain 

D— li = —-— . 

: Uog e D 


(5.15) 


(5.16) 


(5.17) 
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Substituting this in the constraint to find 入 ， we find 入 =1 / log^ D, and 
hence 

Pi = D- li ， (5.18) 

yielding optimal code lengths, 

l* = -log D Pi. (5.19) 

This noninteger choice of codeword lengths yields expected codeword 
length 

L* = J^Pil* = -Y^Pi log D Pi = Ho(X). (5.20) 

But since the // must be integers, we will not always be able to set 
the codeword lengths as in (5.19). Instead, we should choose a set of 
codeword lengths Z/ “close” to the optimal set. Rather than demonstrate 
by calculus that /* = — log D pt is a global minimum, we verify optimality 
directly in the proof of the following theorem. 

Theorem 5.3.1 The expected length L of any instantaneous D-ary code 
for a random variable X is greater than or equal to the entropy Ho(X); 
that is, 

L> H d (X), (5.21) 

with equality if and only if D~ li = pi. 

Proof: We can write the difference between the expected length and the 
entropy as 

L - H D {X) = Tp i l i -Tp i lo g/) 丄 (5.22) 

^ ^ Pi 

=- [Pi log D D~ li + Pi ^°Sd Pi- (5.23) 
Letting r/ = D- li /[^ D~ l J and c = D _/i ，we obtain 

L - H = y^ Pi log D — - log D c (5.24) 

^ n 

= Z)(p||r) + lo gz) i (5.25) 

c 


>0 


( 5 . 26 ) 
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by the nonnegativity of relative entropy and the fact (Kraft inequality) 
that c < 1. Hence, L > H with equality if and only if pi = D— li (i.e., if 
and only if — log D pi is an integer for all /)• □ 

Definition A probability distribution is called D-adic if each of the 
probabilities is equal to D— n for some n. Thus, we have equality in the 
theorem if and only if the distribution of X is D-adic. 

The preceding proof also indicates a procedure for finding an optimal 
code: Find the D-adic distribution that is closest (in the relative entropy 
sense) to the distribution of X. This distribution provides the set of code¬ 
word lengths. Construct the code by choosing the first available node as 
in the proof of the Kraft inequality. We then have an optimal code for X. 

However, this procedure is not easy, since the search for the closest 
D-adic distribution is not obvious. In the next section we give a good 
suboptimal procedure (Shannon-Fano coding). In Section 5.6 we describe 
a simple procedure (Huffman coding) for actually finding the optimal 
code. 


5.4 BOUNDS ON THE OPTIMAL CODE LENGTH 

We now demonstrate a code that achieves an expected description length 
L within 1 bit of the lower bound; that is, 

H(X) <L< H(X) + l. (5.27) 

Recall the setup of Section 5.3: We wish to minimize L = Y^ PiU sub¬ 
ject to the constraint that h ， l 2 , … ， l m are integers and f D~ li < 1. We 
proved that the optimal codeword lengths can be found by finding the 
D-adic probability distribution closest to the distribution of X in relative 
entropy, that is, by finding the D-adic r ( 厂 / = D~ li /^j D - 1 j) minimizing 

L-H d = Z)(p||r)-log (J] > 0. (5.28) 

The choice of word lengths U = log^ y yields L = H • Since log^ y 
may not equal an integer, we round it up to give integer word-length 
assignments, 

h= logo- , (5.29) 
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where \x~\ is the smallest integer > x. These lengths satisfy the Kraft 
inequality since 


= J^ Pi 

This choice of codeword lengths satisfies 

1 1 

log/) — Sk < log/) - h 1. 

Pi Pi 

Multiplying by pi and summing over Z, we obtain 
H D (X)<L<H D (X) + l. 


(5.30) 


(5.31) 


(5.32) 


Since an optimal code can only be better than this code, we have the 
following theorem. 

Theorem 5.4.1 Let /*, ..., be optimal codeword lengths for a 

source distribution p and a D-ary alphabet，and let L* be the associated 
expected length of an optimal code (L* = ^2 Pilf). Then 

H d {X) < L* < H D (X) + l. (5.33) 

Proof: Let U = 「 log D y~\. Then // satisfies the Kraft inequality and from 
(5.32) we have 

H D {X)<L = Y J Pdi < H d {X) + \. (5.34) 

But since L *， the expected length of the optimal code, is less than L = 
pili, and since L* > Ho from Theorem 5.3.1, we have the 
theorem. □ 


In Theorem 5.4.1 there is an overhead that is at most 1 bit, due to the 
fact that log is not always an integer. We can reduce the overhead per 
symbol by spreading it out over many symbols. With this in mind, let us 
consider a system in which we send a sequence of n symbols from X. 
The symbols are assumed to be drawn i.i.d. according to p(x). We can 
consider these n symbols to be a supersymbol from the alphabet X n . 

Define L n to be the expected codeword length per input symbol, that 
is, if /(;ci, 又 2, . • • ， x n ) is the length of the binary codeword associated 


114 


DATA COMPRESSION 


with (xi, ^ 2 ,..., x n ) (for the rest of this section, we assume that D = 2, 
for simplicity), then 

1 ^ 1 

Ln = 一〉 p (x 1, X2 » • . • ， , X2-, • . . ， ~ 一 El (-X^l, ^2? • • • ， X n ~)• 
n L4 n 

(5.35) 

We can now apply the bounds derived above to the code: 

H(X U X 2 ,..., X n ) < El(X u X 2 ,...,X n ) < H(X U X 2 , • • • ， X w ) + 1. 

(5.36) 

Since X u X 2 ,..., X n are i.i.d., H(X U Z 2 ,..., X n ) = J2 H{X t )= 
nH(X). Dividing (5.36) by n, we obtain 

1 

H(X) < L n < H(X) + -. (5.37) 

n 

Hence, by using large block lengths we can achieve an expected code¬ 
length per symbol arbitrarily close to the entropy. 

We can use the same argument for a sequence of symbols from a 
stochastic process that is not necessarily i.i.d. In this case, we still have 
the bound 


H(X U X 2 ,...,X n ) < El(X u X 2 ,...,X n ) < H(X U X 2 ,..., X n ) + l. 

(5.38) 

Dividing by n again and defining L n to be the expected description length 
per symbol, we obtain 

H(X U X 2 , … ， X n ) T H(X l ,X 2 ,...,X n ) _ 1 

- < L n < - 1 —— • (5.39) 

n n n 

If the stochastic process is stationary, then H(X\, X 2 ,..., X n )/n —> 
and the expected description length tends to the entropy rate as 
n —> oc. Thus, we have the following theorem: 


Theorem 5.4.2 The minimum expected codeword length per symbol sat¬ 
isfies 


H(XuX 2 , …， X n ) 


H(X u X 2 ,...,X n ) 


- L n < ~- -- - — + -• 

n n n 

Moreover，if X\ ， X 2 , …， X n is a stationary stochastic process, 

L：^ H(X), 


(5.40) 


where H(X) is the entropy rate of the process. 


(5.41) 
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This theorem provides another justification for the definition of entropy 
rate — it is the expected number of bits per symbol required to describe 
the process. 

Finally, we ask what happens to the expected description length if the 
code is designed for the wrong distribution. For example, the wrong dis¬ 
tribution may be the best estimate that we can make of the unknown true 
distribution. We consider the Shannon code assignment l(x) =「log 
designed for the probability mass function q{x). Suppose that the true 
probability mass function is p(x). Thus, we will not achieve expected 
length L ^ H(p) = _ D p(x) log p{x). We now show that the increase 
in expected description length due to the incorrect distribution is the rel¬ 
ative entropy D{p\\q). Thus, D(p\\q) has a concrete interpretation as the 
increase in descriptive complexity due to incorrect information. 


Theorem 5.4.3 (Wrong code) 
code assignment l{x) =「log 


The expected length under p{x) of the 
satisfies 


H(p) + D(p\\q) < E p l(X) < H(p) + D(p\\q) + l. 
Proof: The expected codelength is 

渾 ) = X > ( x ) 卜忐 

< ? p(x) ( log ^) + 1 ) 

= y p(x )log^lJ_ + l 

十 B q(x) p(x) 

= 尸(幻 log ^ - + pMiog + 1 

” q{x) ^ p(x) 

= D(p\\q) + H(p) + l. 


(5.42) 

(5.43) 

(5.44) 

(5.45) 

(5.46) 

(5.47) 


The lower bound can be derived similarly. 


□ 


Thus, believing that the distribution is q{x) when the true distribution 
is p{x) incurs a penalty of D(p\\q) in the average description length. 


5.5 KRAFT INEQUALITY FOR UNIQUELY DECODABLE CODES 

We have proved that any instantaneous code must satisfy the Kraft inequal¬ 
ity. The class of uniquely decodable codes is larger than the class of 
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instantaneous codes, so one expects to achieve a lower expected codeword 
length if L is minimized over all uniquely decodable codes. In this section 
we prove that the class of uniquely decodable codes does not offer any 
further possibilities for the set of codeword lengths than do instantaneous 
codes. We now give Karush’s elegant proof of the following theorem. 

Theorem 5.5.1 (McMillan) The codeword lengths of any uniquely 
decodable D-ary code must satisfy the Kraft inequality 

^D~ li < 1. (5.48) 

Conversely，given a set of codeword lengths that satisfy this inequality，it 
is possible to construct a uniquely decodable code with these codeword 
lengths. 

Proof: Consider C k ， the 众 th extension of the code (i.e., the code formed 
by the concatenation of k repetitions of the given uniquely decodable code 
C). By the definition of unique decodability, the 众 th extension of the code 
is nonsingular. Since there are only D n different D-ary strings of length n, 
unique decodability implies that the number of code sequences of length 
n in the 灸 th extension of the code must be no greater than D n . We now 
use this observation to prove the Kraft inequality. 

Let the codeword lengths of the symbols x e X be denoted by l(x). 
For the extension code, the length of the code sequence is 

k 

l(xi,x 2 , … ， X k ) = l(xj). (5.49) 

i=l 

The inequality that we wish to prove is 

D~ l(x) < 1. (5.50) 

The trick is to consider the kih power of this quantity. Thus, 

= D -/(xi) D -/(x 2 )... D -I{x k ) (5.52) 

xi ， x 2 ,...,x k eX k 

=D_ l ( x \ (5.53) 

x k eX k 


D- l{X2) ■■- D -’ ⑹ (5.51) 
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by (5.49). We now gather the terms by word lengths to obtain 

众 , max 



D~ l{xk) 

x k eX k 

=^ a(rn)D — m ， 

m=l 

(5.54) 

where / max is the maximum codeword length and a{m) is the number 
of source sequences x k mapping into codewords of length m. But the 
code is uniquely decodable, so there is at most one sequence mapping 
into each code m-sequence and there are at most D m code m-sequences. 
Thus, a{m) < D m ，and we have 


(S D ，) 

k 众 ,max 

m=\ 

(5.55) 



以 max 

< ^2 D m D_ m 

m=l 

(5.56) 



- kl 

— ^^max 

(5.57) 

and hence 

E D ~ li 

< (Wmax) 1A • 

(5.58) 


Since this inequality is true for all k, it is true in the limit as A: —> oo. 
Since (kl max ) l ^ k 1, we have 




(5.59) 


which is the Kraft inequality. 

Conversely, given any set of Zi ， / 之 ，…， / m satisfying the Kraft inequal¬ 
ity, we can construct an instantaneous code as proved in Section 5.2. Since 
every instantaneous code is uniquely decodable, we have also constructed 
a uniquely decodable code. □ 

Corollary A uniquely decodable code for an infinite source alphabet X 
also satisfies the Kraft inequality. 

Proof: The point at which the preceding proof breaks down for infinite 
|A1 is at (5.58), since for an infinite code / max is infinite. But there is a 
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simple fix to the proof. Any subset of a uniquely decodable code is also 
uniquely decodable; thus, any finite subset of the infinite set of codewords 
satisfies the Kraft inequality. Hence, 

oo N 

V D^ li = lim V D~ li < 1. (5.60) 

i=l i=l 

Given a set of word lengths li, I 2 , … that satisfy the Kraft inequality, we 
can construct an instantaneous code as in Section 5.4. Since instantaneous 
codes are uniquely decodable, we have constructed a uniquely decodable 
code with an infinite number of codewords. So the McMillan theorem 
also applies to infinite alphabets. □ 

The theorem implies a rather surprising result — that the class of 
uniquely decodable codes does not offer any further choices for the set 
of codeword lengths than the class of prefix codes. The set of achievable 
codeword lengths is the same for uniquely decodable and instantaneous 
codes. Hence, the bounds derived on the optimal codeword lengths con¬ 
tinue to hold even when we expand the class of allowed codes to the class 
of all uniquely decodable codes. 

5.6 HUFFMAN CODES 

An optimal (shortest expected length) prefix code for a given distribution 
can be constructed by a simple algorithm discovered by Huffman [283]. 
We will prove that any other code for the same alphabet cannot have a 
lower expected length than the code constructed by the algorithm. Before 
we give any formal proofs, let us introduce Huffman codes with some 
examples. 

Example 5.6.7 Consider a random variable X taking values in the set 
X = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respec¬ 
tively. We expect the optimal binary code for X to have the longest 
codewords assigned to the symbols 4 and 5. These two lengths must be 
equal, since otherwise we can delete a bit from the longer codeword and 
still have a prefix code, but with a shorter expected length. In general, 
we can construct a code in which the two longest codewords differ only 
in the last bit. For this code, we can combine the symbols 4 and 5 into 
a single source symbol, with a probability assignment 0.30. Proceeding 
this way, combining the two least likely symbols into one symbol until 
we are finally left with only one symbol, and then assigning codewords 
to the symbols, we obtain the following table: 
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Codeword 

Length 

Codeword 

X 


Probability 


2 

01 

1 

0.25 、 

/°- 3 \ 

/0-45V 

^0.55—-A 

2 

10 

2 

0.25 、 

於 0.25 》 

r-03^/ 

<0.45/ 

2 

11 

3 

0.2J 

^0.25 / 



3 

000 

4 

0.15y 

\0.2〆 



3 

001 

5 

0.1 5 / 





This code has average length 2.3 bits. 

Example 5.6.2 Consider a ternary code for the same random variable. 
Now we combine the three least likely symbols into one supersymbol and 
obtain the following table: 


Codeword 

X 

Probability 

1 

2 

00 

1 

2 

3 

0.25\ 

0.25^/ 

f 0 ： 2 5 y [ 

^-0.25 / 

01 

4 

o.i5y 


02 

5 

0.15 / 



This code has an average length of 1.5 ternary digits. 

Example 5.6.3 If D >3, we may not have a sufficient number of sym¬ 
bols so that we can combine them D at a time. In such a case, we add 
dummy symbols to the end of the set of symbols. The dummy symbols 
have probability 0 and are inserted to fill the tree. Since at each stage of 
the reduction, the number of symbols is reduced by £> — 1， we want the 
total number of symbols to be 1 + k(D — 1), where k is the number of 
merges. Hence, we add enough dummy symbols so that the total number 
of symbols is of this form. For example: 


Codeword 

X 


Probability 


1 

1 

0.25 — 

— 0.25\ 

^ 0.5 

yl.u 

2 

2 

0.25 — 

0.25 乂 

^0.25 

/ 

01 

3 

0.2 \ 


\ 0.25 / 


02 

4 

0.1 y 

^0.2V 



000 

5 

0 . 1 ^ 

^0.1/ 



001 

6 

o.iy 




002 

Dummy 

0.0’ 





This code has an average length of 1.7 ternary digits. 
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A proof of the optimality of Huffman coding is given in Section 5.8. 


5.7 SOME COMMENTS ON HUFFMAN CODES 

1. Equivalence of source coding and 20 questions. We now digress 
to show the equivalence of coding and the game “20 questions”. 
Suppose that we wish to find the most efficient series of yes-no 
questions to determine an object from a class of objects. Assuming 
that we know the probability distribution on the objects, can we find 
the most efficient sequence of questions? (To determine an object, 
we need to ensure that the responses to the sequence of questions 
uniquely identifies the object from the set of possible objects; it is 
not necessary that the last question have a “yes” answer.) 

We first show that a sequence of questions is equivalent to a code 
for the object. Any question depends only on the answers to the 
questions before it. Since the sequence of answers uniquely deter¬ 
mines the object, each object has a different sequence of answers, 
and if we represent the yes-no answers by 0’s and Vs, we have a 
binary code for the set of objects. The average length of this code 
is the average number of questions for the questioning scheme. 

Also, from a binary code for the set of objects, we can find a 
sequence of questions that correspond to the code, with the average 
number of questions equal to the expected codeword length of the 
code. The first question in this scheme becomes: Is the first bit equal 
to 1 in the object’s codeword? 

Since the Huffman code is the best source code for a random 
variable, the optimal series of questions is that determined by the 
Huffman code. In Example 5.6.1 the optimal first question is: Is 
X equal to 2 or 3? The answer to this determines the first bit of 
the Huffman code. Assuming that the answer to the first question 
is “yes,” the next question should be “Is X equal to 3?’’，which 
determines the second bit. However, we need not wait for the answer 
to the first question to ask the second. We can ask as our second 
question “Is X equal to 1 or 3?’’， determining the second bit of the 
Huffman code independent of the first. 

The expected number of questions EQ in this optimal scheme 
satisfies 


H(X) <EQ< H(X) + l. 


(5.61) 
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2. Huffman coding for weighted codewords. Huffman’s algorithm for 
minimizing ^ piU can be applied to any set of numbers pi > 0, 
regardless of E pt ，In this case, the Huffman code minimizes the 
sum of weighted code lengths ^ WiU rather than the average code 
length. 

Example 5.7.1 We perform the weighted minimization using the 
same algorithm. 



In this case the code minimizes the weighted sum of the codeword 
lengths, and the minimum weighted sum is 36. 

3. Huffman coding and “slice” questions (Alphabetic codes). We have 
described the equivalence of source coding with the game of 20 
questions. The optimal sequence of questions corresponds to an 
optimal source code for the random variable. However, Huffman 
codes ask arbitrary questions of the form “Is X e AT for any set 
A c {1, 2,, m}. 

Now we consider the game “20 questions” with a restricted set 
of questions. Specifically, we assume that the elements of X = 
{1, 2,..., m} are ordered so that p\ > P 2 ^ , ^ Pm and that the 
only questions allowed are of the form “Is X > aT for some a. The 
Huffman code constructed by the Huffman algorithm may not cor¬ 
respond to slices (sets of the form {x : x < a}). If we take the code¬ 
word lengths {l\ < h S • • • S by Lemma 5.8.1) derived from the 
Huffman code and use them to assign the symbols to the code tree 
by taking the first available node at the corresponding level, we 
will construct another optimal code. However, unlike the Huffman 
code itself, this code is a slice code, since each question (each bit 
of the code) splits the tree into sets of the form {x \ x > a} and 
{x : x < a}. 

We illustrate this with an example. 


Example 5.7.2 Consider the first example of Section 5.6. The 
code that was constructed by the Huffman coding procedure is not a 
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slice code. But using the codeword lengths from the Huffman pro¬ 
cedure, namely, {2, 2, 2, 3, 3}, and assigning the symbols to the first 
available node on the tree, we obtain the following code for this 
random variable: 

1 ^ 00, 2^ 01, 3 ^ 10, 4 w 110, 5 ^ 111 

It can be verified that this code is a slice code, codes known as 
alphabetic codes because the codewords are ordered alphabetically. 

4. Huffman codes and Shannon codes • Using codeword lengths of 
「log j-~] (which is called Shannon coding) may be much worse than 
the optimal code for some particular symbol. For example, con¬ 
sider two symbols, one of which occurs with probability 0.9999 and 
the other with probability 0.0001. Then using codeword lengths of 
「log 丄 1 gives codeword lengths of 1 bit and 14 bits, respectively. 
The optimal codeword length is obviously 1 bit for both symbols. 
Hence, the codeword for the infrequent symbol is much longer in 
the Shannon code than in the optimal code. 

Is it true that the codeword lengths for an optimal code are always 
less than「log The following example illustrates that this is not 
always true. 

Example 5.7.3 Consider a random variable X with a distribution 
(H ! ， 長 ). The Huffman coding procedure results in codeword 
lengths of (2, 2, 2, 2) or (1, 2, 3, 3) [depending on where one puts 
the merged probabilities, as the reader can verify (Problem 5.5.12)]. 
Both these codes achieve the same expected codeword length. In the 
second code, the third symbol has length 3, which is greater than 
「log 去 "I • Thus, the codeword length for a Shannon code could be 
less than the codeword length of the corresponding symbol of an 
optimal (Huffman) code. This example also illustrates the fact that 
the set of codeword lengths for an optimal code is not unique (there 
may be more than one set of lengths with the same expected value). 

Although either the Shannon code or the Huffman code can be 
shorter for individual symbols, the Huffman code is shorter on aver¬ 
age. Also, the Shannon code and the Huffman code differ by less 
than 1 bit in expected codelength (since both lie between H and 
H + l.) 
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5. Fano codes . Fano proposed a suboptimal procedure for constructing 
a source code, which is similar to the idea of slice codes. In his 
method we first order the probabilities in decreasing order. Then we 

choose k such that pt — Y^T=k+\ Pi is minimized. This point 


divides the source symbols into two sets of almost equal probability. 
Assign 0 for the first bit of the upper set and 1 for the lower set. 
Repeat this process for each subset. By this recursive procedure, we 
obtain a code for each source symbol. This scheme, although not 
optimal in general, achieves L(C) < ff(X) + 2. (See [282].) 


5.8 OPTIMALITY OF HUFFMAN CODES 

We prove by induction that the binary Huffman code is optimal. It is 
important to remember that there are many optimal codes: inverting all 
the bits or exchanging two codewords of the same length will give another 
optimal code. The Huffman procedure constructs one such optimal code. 
To prove the optimality of Huffman codes, we first prove some properties 
of a particular optimal code. 

Without loss of generality, we will assume that the probability masses 
are ordered, so that p\ > P 2 ^ , , ^ Pm- Recall that a code is optimal if 
D p“i is minimal. 

Lemma 5.8.1 For any distribution，there exists an optimal instantaneous 
code (with minimum expected length) that satisfies the following proper¬ 
ties: 

1. The lengths are ordered inversely with the probabilities (i.e., if pj > 
Pb then lj < l k ). 

2. The two longest codewords have the same length. 

3. Two of the longest codewords differ only in the last bit and corre¬ 
spond to the two least likely symbols. 

Proof: The proof amounts to swapping, trimming, and rearranging, as 
shown in Figure 5.3. Consider an optimal code C m : 

• vpj > pk，then lj < 4, Here we swap codewords. 

Consider C 二 ， with the codewords j and k of C m interchanged. Then 

L(0 — L(C m ) = Y /Pi l/ (5.62) 

~ Pj^k H - Pklj _ Pjlj _ Pklk (5.63) 

= (Pj-Pk)(h-lj). (5.64) 
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FIGURE 5.3. Properties of optimal codes. We assume that p\ > pi > • • - > p m - ^ possible 
instantaneous code is given in (a). By trimming branches without siblings, we improve the 
code to (b). We now rearrange the tree as shown in (c), so that the word lengths are ordered 
by increasing length from top to bottom. Finally, we swap probability assignments to improve 
the expected depth of the tree, as shown in (d). Every optimal code can be rearranged and 
swapped into canonical form as in (d), where /i < /2 < < /m and l m -\ = l m , and the last 

two codewords differ only in the last bit. 

But pj — pk > 0, and since C m is optimal, L{C r m ) — L{C m ) > 0. 
Hence, we must have 4 > lj. Thus, C m itself satisfies property 1. 

• The two longest codewords are of the same length. Here we trim the 
codewords. If the two longest codewords are not of the same length, 
one can delete the last bit of the longer one, preserving the prefix 
property and achieving lower expected codeword length. Hence, the 
two longest codewords must have the same length. By property 1, the 
longest codewords must belong to the least probable source symbols. 

• The two longest codewords differ only in the last bit and correspond 
to the two least likely symbols. Not all optimal codes satisfy this 
property, but by rearranging, we can find an optimal code that does. 
If there is a maximal-length codeword without a sibling, we can delete 
the last bit of the codeword and still satisfy the prefix property. This 
reduces the average codeword length and contradicts the optimality 
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of the code. Hence, every maximal-length codeword in any optimal 
code has a sibling. Now we can exchange the longest codewords so 
that the two lowest-probability source symbols are associated with 
two siblings on the tree. This does not change the expected length, 
^2 Pih. Thus, the codewords for the two lowest-probability source 
symbols have maximal length and agree in all but the last bit. 

Summarizing, we have shown that if pi > P 2 2 • ••仝 Pm，there exists 
an optimal code with l\ < Z 2 < • • • < l m -i — l m , and codewords C(x m _i) 
and C{x m ) that differ only in the last bit. □ 

Thus, we have shown that there exists an optimal code satisfy¬ 
ing the properties of the lemma. We call such codes canonical codes. 
For any probability mass function for an alphabet of size m ， p = 
(pi, P2, … ， Pm) with pi > P 2 > ••- > Pm, we define the Huffman reduc¬ 
tion = (pi, p2, … ， Pm- 2 , Pm-i + Pm) over an alphabet of size m — l 
(Figure 5.4). Let C^^pO be an optimal code for p ’， and let C*(p) be 
the canonical optimal code for p. 

The proof of optimality will follow from two constructions: First, we 
expand an optimal code for to construct a code for p, and then we 




FIGURE 5.4. Induction step for Huffman coding. Let /?i > P2 > * * • > P5- A canonical 
optimal code is illustrated in (a). Combining the two lowest probabilities, we obtain the 
code in (b). Rearranging the probabilities in decreasing order, we obtain the canonical code 
in (c) for m — 1 symbols. 
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condense an optimal canonical code for p to construct a code for the 
Huffman reduction p’. Comparing the average codeword lengths for the 
two codes establishes that the optimal code for p can be obtained by 
extending the optimal code for p’. 

From the optimal code for p ’， we construct an extension code for m 
elements as follows: Take the codeword in corresponding to weight 
p m -i + Pm and extend it by adding a 0 to form a codeword for symbol 
m — 1 and by adding 1 to form a codeword for symbol m. The code 
construction is illustrated as follows: 


Pi 

O) 

U)[ 


Cm (p) 

W\ = w[ 

h 

= l \ 

P2 

W ’ 2 


W2 

=^2 

h 

= 1’2 

Pm—2 

K-2 

l’m-2 

U^m—2 

=W m-2 

lm-2 

= Kn-2 

Pm—1 "I - Pm 

<-l 

C-1 

^m—1 

= <-l° 

lm—\ 

= ’’m-1 + 1 




⑷ m 

= W m-l l 

lm 

= z 二 — i + 1 


(5.65) 

Calculation of the average length /?•/• shows that 

L(p) = L*(p r ) + p m _i + p m . (5.66) 

Similarly, from the canonical code for p, we construct a code for by 
merging the codewords for the two lowest-probability symbols m — l and 
m with probabilities p m —\ and p m ，which are siblings by the properties 
of the canonical code. The new code for p 7 has average length 


m —2 

^(P ) ~ > : Pih Pm — 1 Qm — 1 _ 1) + Pm (Jm _ 1) (5.67) 

i=l 

m 

= 〉: PiU — Pm—1 _ Pm ( 5 . 68 ) 

i=l 

= L*( P ) - p m -X - Pm- (5.69) 

Adding (5.66) and (5.69) together, we obtain 

L(p') + L(p) = L*(p') + L*(p) (5.70) 

(L(pO — L*(pO) + (L(p) - L*(p)) = 0. (5.71) 


or 
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Now examine the two terms in (5.71). By assumption, since L*(p r ) is the 
optimal length for p ’， we have L(p r ) — L*(pO > 0. Similarly, the length 
of the extension of the optimal code for p f has to have an average length 
at least as large as the optimal code for p [i.e .， L(p) — L*(p) > 0]. But 
the sum of two nonnegative terms can only be 0 if both of them are 0, 
which implies that L(p) = L*(p) (i.e.，the extension of the optimal code 
for p，is optimal for p). 

Consequently, if we start with an optimal code for p 7 with m — 1 sym¬ 
bols and construct a code for m symbols by extending the codeword 
corresponding to p m -\ + p m , the new code is also optimal. Starting with 
a code for two elements, in which case the optimal code is obvious, we 
can by induction extend this result to prove the following theorem. 

Theorem 5.8.1 Huffman coding is optimal; that is, if C* is a Huffman 
code and C' is any other uniquely decodable code, L(C*) < L{C r ). 

Although we have proved the theorem for a binary alphabet, the proof 
can be extended to establishing optimality of the Huffman coding algo¬ 
rithm for a D-ary alphabet as well. Incidentally, we should remark that 
Huffman coding is a “greedy” algorithm in that it coalesces the two least 
likely symbols at each stage. The proof above shows that this local opti¬ 
mality ensures global optimality of the final code. 
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In Section 5.4 we showed that the codeword lengths l{x)= 


[ lo g^o] sat - 


isfy the Kraft inequality and can therefore be used to construct a uniquely 
decodable code for the source. In this section we describe a simple con¬ 
structive procedure that uses the cumulative distribution function to allot 


codewords. 


Without loss of generality, we can take X = {1,2,..., m}. Assume that 
p{x) > 0 for all x. The cumulative distribution function F{x) is defined 
as 


F(x) = ^p{a). (5.72) 

a<x 

This function is illustrated in Figure 5.5. Consider the modified cumulative 
distribution function 

= ^2 〆“）+ ^P ( x )， 


(5.73) 
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F(x-1) 



1 2 

FIGURE 5.5. Cumulative distribution function and Shannon - Fano - Elias coding. 


where F(x) denotes the sum of the probabilities of all symbols less than 
x plus half the probability of the symbol x. Since the random variable is 
discrete, the cumulative distribution function consists of steps of size p(x). 
The value of the function F(x) is the midpoint of the step corresponding 
to x. 

Since all the probabilities are positive, F (a) # F (b) if a # b, and hence 
we can determine x if we know F(x). Merely look at the graph of the 
cumulative distribution function and find the corresponding x. Thus, the 
value of F(x) can be used as a code for x. 

But, in general, F{x) is a real number expressible only by an infinite 
number of bits. So it is not efficient to use the exact value of F(x) as a 
code for x. If we use an approximate value, what is the required accuracy? 

Assume that we truncate F(x) to l{x) bits (denoted by [F(x)]i( x )). 
Thus, we use the first l(x) bits of F(x) as a code for x. By definition of 
rounding off, we have 

T(x) — iF(x)U {x) < 占 （ 5.74) 

If Z ⑴ =[log^y + 1, then 

1 p(x) — 

Yix) <L Y~ = F(x) - F(x - 1), (5.75) 

and therefore [F(x)ji( x ) lies within the step corresponding to x. Thus, 
l{x) bits suffice to describe x. 
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In addition to requiring that the codeword identify the corresponding 
symbol, we also require the set of codewords to be prefix-free. To check 
whether the code is prefix-free, we consider each codeword Z 1 Z 2 • • • z/ to 


represent not a point but the interval 


0.ZIZ2' " ZhO.Z\Z2- " Zl + 




The 


code is prefix-free if and only if the intervals corresponding to codewords 
are disjoint. 

We now verify that the code above is prefix-free. The interval corre¬ 
sponding to any codeword has length 2—’ ⑻， which is less than half the 
height of the step corresponding to x by (5.75). The lower end of the 
interval is in the lower half of the step. Thus, the upper end of the inter¬ 
val lies below the top of the step, and the interval corresponding to any 
codeword lies entirely within the step corresponding to that symbol in the 
cumulative distribution function. Therefore, the intervals corresponding to 
different codewords are disjoint and the code is prefix-free. Note that this 
procedure does not require the symbols to be ordered in terms of proba¬ 
bility. Another procedure that uses the ordered probabilities is described 
in Problem 5.5.28. 

Since we use l(x) = ^log^y] + 1 bits to represent x, the expected 
length of this code is 


l = p ⑽ w = Z poo 


log 


P(x) 


+ 1 < H(X) + 2. (5.76) 


Thus, this coding scheme achieves an average codeword length that is 
within 2 bits of the entropy. 

Example 5.9.1 We first consider an example where all the probabilities 
are dyadic. We construct the code in the following table: 


X 

P(x) 

F(x) 

~F(x) 

F(x) in Binary 

⑼ = 

1 _ 
log / 、 

PW 

+ 1 Codeword 

1 

0.25 

0.25 

0.125 

0.001 


3 

001 

2 

0.5 

0.75 

0.5 

0.10 


2 

10 

3 

0.125 

0.875 

0.8125 

0.1101 


4 

1101 

4 

0.125 

1.0 

0.9375 

0.1111 


4 

1111 


In this case, the average codeword length is 2.75 bits and the entropy 
is 1.75 bits. The Huffman code for this case achieves the entropy 
bound. Looking at the codewords, it is obvious that there is some inef¬ 
ficiency —— for example, the last bit of the last two codewords can be 
omitted. But if we remove the last bit from all the codewords, the code 
is no longer prefix-free. 
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Example 5.9.2 We now give another example for construction of the 
Shannon-Fano-Elias code. In this case, since the distribution is not 
dyadic, the representation of F(x) in binary may have an infinite number 
of bits. We denote 0.01010101 ... by 0.01. We construct the code in the 
following table: 


X 

P(x) 

F(x) 

F(x) 

F(x) in Binary 

/(x)= 

1 

log , 、 
pOO 

+ 1 Codeword 

1 

0.25 

0.25 

0.125 

0.001 


3 

001 

2 

0.25 

0.5 

0.375 

0.011 


3 

Oil 

3 

0.2 

0.7 

0.6 

0.10011 


4 

1001 

4 

0.15 

0.85 

0.775 

0.1100011 


4 

1100 

5 

0.15 

1.0 

0.925 

0.1110110 


4 

1110 


The above code is 1.2 bits longer on the average than the Huffman 
code for this source (Example 5.6.1). 

The Shannon-Fano-Elias coding procedure can also be applied to 
sequences of random variables. The key idea is to use the cumulative 
distribution function of the sequence, expressed to the appropriate accu¬ 
racy, as a code for the sequence. Direct application of the method to blocks 
of length n would require calculation of the probabilities and cumulative 
distribution function for all sequences of length n, a calculation that would 
grow exponentially with the block length. But a simple trick ensures that 
we can calculate both the probability and the cumulative density func¬ 
tion sequentially as we see each symbol in the block, ensuring that the 
calculation grows only linearly with the block length. Direct application 
of Shannon-Fano-Elias coding would also need arithmetic whose preci¬ 
sion grows with the block size, which is not practical when we deal with 
long blocks. In Chapter 13 we describe arithmetic coding, which is an 
extension of the Shannon-Fano-Elias method to sequences of random 
variables that encodes using fixed-precision arithmetic with a complexity 
that is linear in the length of the sequence. This method is the basis of 
many practical compression schemes such as those used in the JPEG and 
FAX compression standards. 

5.10 COMPETITIVE OPTIMALITY OF THE SHANNON CODE 

We have shown that Huffman coding is optimal in that it has minimum 
expected length. But what does that say about its performance on any 
particular sequence? For example, is it always better than any other code 
for all sequences? Obviously not, since there are codes that assign short 











5.10 COMPETITIVE OPTIMALITY OF THE SHANNON CODE 131 


codewords to infrequent source symbols. Such codes will be better than 
the Huffman code on those source symbols. 

To formalize the question of competitive optimality, consider the fol¬ 
lowing two-person zero-sum game: Two people are given a probability 
distribution and are asked to design an instantaneous code for the dis¬ 
tribution. Then a source symbol is drawn from this distribution, and the 
payoff to player A is 1 or —1, depending on whether the codeword of 
player A is shorter or longer than the codeword of player B. The payoff 
is 0 for ties. 

Dealing with Huffman code lengths is difficult, since there is no explicit 
expression for the codeword lengths. Instead, we consider the Shannon 
code with codeword lengths l(x) =「log -^y .In this case, we have the 
following theorem. 


Theorem 5.10.1 Let l(x) be the codeword lengths associated with the 
Shannon code，and let l\x) be the codeword lengths associated with any 
other uniquely decodable code. Then 


Vr(l{X)>l\X) + c) < (5.77) 

For example, the probability that l'{X) is 5 or more bits shorter than 
1{X) is less than 古 . 

Proof 


Pr(/(Z) > l\X) + c) =Pr| 


log 


P(X) 


> l\X) + c 


< Pr log 


P(X) 


> l\X) + c — 1 


Pr 


(p(X) 


< 2 —' 


l\X)-c+l' s 


(5.78) 

(5.79) 

(5.80) 


L P(x) (5.81) 

x: p(jc)<2 _// ^) _c+1 


< 


E 




X ： - "( x ) _c+1 


(5.82) 


< ^2- / ， (x, 2- (c_1) (5.83) 

X 

< 2- (c_1) (5.84) 

since [ 2— l ( x ) < 1 by the Kraft inequality. □ 
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Hence, no other code can do much better than the Shannon code most 
of the time. We now strengthen this result. In a game-theoretic setting, 
one would like to ensure that l(x) < l\x) more often than l{x) > l f {x). 
The fact that l{x) < l r {x) + 1 with probability > \ does not ensure this. 
We now show that even under this stricter criterion, Shannon coding is 
optimal. Recall that the probability mass function p(x) is dyadic if log 
is an integer for all x. 

Theorem 5.10.2 For a dyadic probability mass function p(x)，let 
l{x) = log be the word lengths of the binary Shannon code for the 
source, and let l r {x) be the lengths of any other uniquely decodable binary 
code for the source. Then 

Pr(/(X) < l\X)) > Pr(/(Z) > l\X ))， (5.85) 

with equality if and only if l\x) = l(x) for all x. Thus, the code length 
assignment l(x) = log is uniquely competitively optimal. 

Proof: Define the function sgn(0 as follows: 

1 if t >0 

sgn(，）= 0 if f = 0 (5.86) 

一 1 if t <0 

Then it is easy to see from Figure 5.6 that 

sgn(0 < 2 r - 1 for t = 0, 土 1 ，士 2, • • • • （ 5.87) 


sgn(x) 



FIGURE 5.6. Sgn function and a bound. 
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Note that though this inequality is not satisfied for all t, it is satisfied 
at all integer values of t. We can now write 

Pr (/ '(Z) < /(Z)) - Pr(l\X) > /(X))= E 尸 ⑴ -L 尸⑴ 

x:/ / (^:)</(x) x:l , (x)>l(x) 


(a) 

< 


(b) 



(5.88) 

^ p(x)sgn(l(x) - l\x)) 


X 

(5.89) 

Esgn(l(X)-l\X)) 

(5.90) 

J^p(x) ( 2 ’W-"W —i) 



(5.91) 

〉: 2- ,( x ) (2,( x ) _ ,’( x ) — i) 



(5.92) 

j22~ if(x) -J2 2 ~ l(x) 

X X 

(5.93) 

E 2 -㈤ — 1 

X 

(5.94) 

1 

1-1 

(5.95) 

0 ， 

(5.96) 


where (a) follows from the bound on sgn(x) and (b) follows from the fact 
that l\x) satisfies the Kraft inequality. 

We have equality in the above chain only if we have equality in (a) 
and (b). We have equality in the bound for sgn(，）only if ^ is 0 or 1 [i.e., 
l{x) = l\x) or l{x) = l r {x) + 1]. Equality in (b) implies that l\x) satisfies 
the Kraft inequality with equality. Combining these two facts implies that 
l’(x) = l{x) for all x. □ 

Corollary For nondyadic probability mass functions, 

E sgn(l(X) - l\X) - 1) <0, (5.97) 

where l{x) =「log and l r {x) is any other code for the source. 
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Proof: Along the same lines as the preceding proof. □ 

Hence we have shown that Shannon coding l(x) =「log is opti¬ 
mal under a variety of criteria; it is robust with respect to the payoff 
function. In particular, for dyadic p, E(l — V) < 0, E sgn (/ — V) < 0, and 
by use of inequality (5.87), Ef(l — V) <0 for any function / satisfying 
_M = 0, 士 1 ，士 2，.... 


5.11 GENERATION OF DISCRETE DISTRIBUTIONS FROM FAIR 
COINS 


In the early sections of this chapter we considered the problem of repre¬ 
senting a random variable by a sequence of bits such that the expected 
length of the representation was minimized. It can be argued (Prob¬ 
lem 5.5.29) that the encoded sequence is essentially incompressible and 
therefore has an entropy rate close to 1 bit per symbol. Therefore, the bits 
of the encoded sequence are essentially fair coin flips. 

In this section we take a slight detour from our discussion of source 
coding and consider the dual question. How many fair coin flips does 
it take to generate a random variable X drawn according to a specified 
probability mass function p? We first consider a simple example. 


Example 5.11.1 Given a sequence of fair coin tosses (fair bits), suppose 
that we wish to generate a random variable X with distribution 


X = 


a with probability 
b with probability 
c with probability 


(5.98) 


It is easy to guess the answer. If the first bit is 0, we let X = a. H the 
first two bits are 10, we let X = b.lf we see 11, we let X = c. It is clear 
that X has the desired distribution. 

We calculate the average number of fair bits required for generating 
the random variable, in this case as 士 (1) + *(2) + |(2) = 1.5 bits. This 
is also the entropy of the distribution. Is this unusual? No, as the results 
of this section indicate. 


The general problem can now be formulated as follows. We are given a 
sequence of fair coin tosses Zj, Z 2 ,..., and we wish to generate a discrete 
random variable X e X = {1,2, ..., m} with probability mass function 
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FIGURE 5.7. Tree for generation of the distribution (m). 

p = (p 1? p 2 , ..., p m ). Let the random variable T denote the number of 
coin flips used in the algorithm. 

We can describe the algorithm mapping strings of bits Zi, Z 2 ,..., to 
possible outcomes X by a binary tree. The leaves of the tree are marked 
by output symbols X ， and the path to the leaves is given by the sequence 
of bits produced by the fair coin. For example, the tree for the distribution 
( 去， I ， I) is shown in Figure 5.7. 

The tree representing the algorithm must satisfy certain properties: 

1. The tree should be complete (i.e., every node is either a leaf or has 
two descendants in the tree). The tree may be infinite, as we will 
see in some examples. 

2. The probability of a leaf at depth k is 2~ k . Many leaves may be 
labeled with the same output symbol — the total probability of all 
these leaves should equal the desired probability of the output sym¬ 
bol. 

3. The expected number of fair bits ET required to generate X is equal 
to the expected depth of this tree. 

There are many possible algorithms that generate the same output dis¬ 
tribution. For example, the mapping 00 ^ a, 01 —> b, 10 —> c, 11 —> a 
also yields the distribution (U ， 士 ). However, this algorithm uses two 
fair bits to generate each sample and is therefore not as efficient as the 
mapping given earlier, which used only 1.5 bits per sample. This brings 
up the question: What is the most efficient algorithm to generate a given 
distribution, and how is this related to the entropy of the distribution? 

We expect that we need at least as much randomness in the fair bits as 
we produce in the output samples. Since entropy is a measure of random¬ 
ness, and each fair bit has an entropy of 1 bit, we expect that the number 
of fair bits used will be at least equal to the entropy of the output. This 
is proved in the following theorem. We will need a simple lemma about 
trees in the proof of the theorem. Let y denote the set of leaves of a com¬ 
plete tree. Consider a distribution on the leaves such that the probability 
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of a leaf at depth k on the tree is 2~ k . Let F be a random variable with 
this distribution. Then we have the following lemma. 

Lemma 5.11.1 For any complete tree，consider a probability distribu¬ 
tion on the leaves such that the probability of a leaf at depth k is 2~ k . Then 
the expected depth of the tree is equal to the entropy of this distribution. 

Proof: The expected depth of the tree 

ET = J2 k (y) 2 ~ k(y) (5.99) 

and the entropy of the distribution of Y is 

H(Y) = — J2 y ey Wy) (5.100) 

= Eyey k (y)2~ kM , (5.101) 

where /:(}；) denotes the depth of leaf y. Thus, 

H{Y) = ET. □ (5.102) 


Theorem 5.11.1 For any algorithm generating X, the expected number 
of fair bits used is greater than the entropy H(X\ that is, 

ET > H(X). (5.103) 

Proof: Any algorithm generating X from fair bits can be represented by 
a complete binary tree. Label all the leaves of this tree by distinct symbols 
j G 3^ = {1, 2,...}. If the tree is infinite, the alphabet y is also infinite. 

Now consider the random variable Y defined on the leaves of the tree, 
such that for any leaf y at depth k, the probability that Y = y is 2~ k . 
By Lemma 5.11.1, the expected depth of this tree is equal to the entropy 
of F: 

ET = H(Y). (5.104) 

Now the random variable Z is a function of Y (one or more leaves 
map onto an output symbol), and hence by the result of Problem 2.4, we 
have 


H(X) < H(Y). 


(5.105) 
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Thus, for any algorithm generating the random variable X ， we have 

H(X) < ET. □ (5.106) 


The same argument answers the question of optimality for a dyadic dis¬ 
tribution. 

Theorem 5.11.2 Let the random variable X have a dyadic distribu¬ 
tion. The optimal algorithm to generate X from fair coin flips requires an 
expected number of coin tosses precisely equal to the entropy: 

ET = H(X). (5.107) 


Proof: Theorem 5.11.1 shows that we need at least H(X) bits to generate 
X. For the constructive part, we use the Huffman code tree for X as 
the tree to generate the random variable. For a dyadic distribution, the 
Huffman code is the same as the Shannon code and achieves the entropy 
bound. For any x e Af, the depth of the leaf in the code tree corresponding 
to x is the length of the corresponding codeword, which is log Hence, 
when this code tree is used to generate X ， the leaf will have a probability 

2 lo§ 兩 = p{x). The expected number of coin flips is the expected depth 
of the tree, which is equal to the entropy (because the distribution is 
dyadic). Hence, for a dyadic distribution, the optimal generating algorithm 
achieves 


ET = H(X). □ (5.108) 

What if the distribution is not dyadic? In this case we cannot use the 
same idea, since the code tree for the Huffman code will generate a dyadic 
distribution on the leaves, not the distribution with which we started. Since 
all the leaves of the tree have probabilities of the form 2~ k , it follows that 
we should split any probability pi that is not of this form into atoms of this 
form. We can then allot these atoms to leaves on the tree. For example, if 
one of the outcomes x has probability p(x) = we need only one atom 
(leaf of the tree at level 2 ), but if p(x) = 5 = 5 + 5 +|» we need three 
atoms, one each at levels 1, 2, and 3 of the tree. 

To minimize the expected depth of the tree, we should use atoms with 
as large a probability as possible. So given a probability pi, we find the 
largest atom of the form 2- k that is less than p“ and allot this atom to 
the tree. Then we calculate the remainder and find that largest atom that 
will fit in the remainder. Continuing this process, we can split all the 
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(5.113) 



(5.114) 


These can be allotted to a tree as shown in Figure 5.8. 


This procedure yields a tree that generates the random variable X. 
We have argued that this procedure is optimal (gives a tree of minimum 
expected depth), but we will not give a formal proof. Instead, we bound 
the expected depth of the tree generated by this procedure. 


probabilities into dyadic atoms. This process is equivalent to finding the 
binary expansions of the probabilities. Let the binary expansion of the 
probability pi be 


Pi 




U) 


(5.109) 


where = 2 -7 or 0. Then the atoms of the expansion are the {p\^ : 
/ = 1,2, j > 1}. 

Since pt = 1, the sum of the probabilities of these atoms is 1. 
We will allot an atom of probability 2 _ - / to a leaf at depth j on the 
tree. The depths of the atoms satisfy the Kraft inequality, and hence by 
Theorem 5.2.1, we can always construct such a tree with all the atoms at 
the right depths. We illustrate this procedure with an example. 


Example 5.11.2 Let X have the distribution 
X = 


a with probability 
b with probability 


We find the binary expansions of these probabilities: 

- = 0 . 10101010...2 


(5.110) 


(5.111) 


0.01010101 


(5.112) 


Hence, the atoms for the expansion are 
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FIGURE 5.8. Tree to generate a (|, distribution. 


Theorem 5.11.3 The expected number of fair bits required by the opti¬ 
mal algorithm to generate a random variable X lies between H{X) and 
H(X) + 2: 


H(X) < ET < H(X) + 2. (5.115) 


Proof: The lower bound on the expected number of coin tosses is proved 
in Theorem 5.11.1. For the upper bound, we write down an explicit 
expression for the expected number of coin tosses required for the proce¬ 
dure described above. We split all the probabilities (p\, p 2 , …， p m ) into 
dyadic atoms, for example, 


Pi 



(5.116) 


and so on. Using these atoms (which form a dyadic distribution), we 
construct a tree with leaves corresponding to each of these atoms. The 
number of coin tosses required to generate each atom is its depth in the 
tree, and therefore the expected number of coin tosses is the expected 
depth of the tree, which is equal to the entropy of the dyadic distribution 
of the atoms. Hence, 


ET = H(Y), (5.117) 

where Y has the distribution, (p[ l \ p | 2 )， .. • ， p^\ P^\ •.. ， Pm \ Pm \ - - •)• 
Now since X is a function of Y, we have 

H(Y) = ff(Y, X) = H(X) + H(Y\X), (5.118) 
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and our objective is to show that H(Y\X) < 2. We now give an algebraic 
proof of this result. Expanding the entropy of Y, we have 

H(Y) = - E：li log#) (5-119) 

= T ：； UE J:p UKoj^ j ^ (5-120) 

since each of the atoms is either 0 or 2- k for some k. Now consider the 
term in the expansion corresponding to each i, which we shall call T[ \ 

Ti= J2 J 2 ~ j - (5-121) 

j-pj j) >o 

We can find an n such that 2—( w_1 ) > p/ > 2~ n , or 

n — l < — log pi < n. (5.122) 

Then it follows that p\^ > 0 only if j > n, so that we can rewrite (5.121) 
as 

Ti = j2~j. (5.123) 

We use the definition of the atom to write pt as 

Pi = Y, 2 乂 (5-124) 

To prove the upper bound, we first show that Ti < —pi log pi + 2pi ， 
Consider the difference 

(a) 

Ti + pi log pi -2pi < Tj - pi (n - 1) - 2pi (5.125) 

= Ti-(n-l + 2) Pi (5.126) 

=J] j2~J - («+ 1) J2 2 ~ j 

j:j>n,pj j) >0 

(5.127) 

= U-n- l)2~j (5.128) 

j: 兑 "，/ ^)>o 


SUMMARY 
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= —2_ n + 0 + U- n ~ 1 ) 2 " 7 ' 

7：7>»+2,^' ) >0 




(5.129) 

(!) —2 - 

~ n + ^ k2 - (k+n+l) 

t.k>l,p^ k+n+1) >0 

(5.130) 

(C) ^ 

< -2~ 

"+ ^ k2 ~(k+n+l) 
k ： k>l 

(5.131) 

=-1— ‘ 

n + 2~ (n+l) 2 

(5.132) 

= 0’ 


(5.133) 


where (a) follows from (5.122), (b) follows from a change of variables 
for the summation, and (c) follows from increasing the range of the sum¬ 
mation. Hence, we have shown that 

Ti < -pilogpi +2pi. (5.134) 

Since ET = Ti, it follows immediately that 

ET Pi log 内 + 21A = H(X) + 2, (5.135) 

i i 

completing the proof of the theorem. 

□ 


Thus, an average of H(X) + 2 coin flips suffice to simulate a random 
variable X. 


SUMMARY 

Kraft inequality. Instantaneous codes ^ D ~ l i < 1. 

McMillan inequality. Uniquely decodable codes 分 L D~ li < 1. 

Entropy bound on data compression 

L = ^ H d( x ) - (5.136) 
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Shannon code 


log/) 


Pi 


(5.137) 


Huffman code 


H D (X)<L<H D (X) + l. 


L* = min ptU 


(5.138) 


(5.139) 


H d (X)<L^ <H D (X) + l. 


(5.140) 


Wrong code. X 〜 p{x), l(x) =「log 点，乙 =[ p(x)l(x): 


H(p) + D(p\\q) <L< H(p) + D(p\\q) + l. 

Stochastic processes 

H(X U X 2 , … ， X n ) T H(X u X 2 ,...,X n ) ( 1 

- < L n < - h 

n n n 


Stationary processes 


L n ^ H(X). 


Competitive optimality. Shannon code l(x) = 「 log^y 
other code V{x)\ 


(5.141) 

(5.142) 

(5.143) 
versus any 


Pr(l(X)>l\X) 


2 c-r 


(5.144) 


PROBLEMS 

5.1 Uniquely decodable and instantaneous codes . Let 
L = Y1T=\ A'// 00 be the expected value of the 100th power 
of the word lengths associated with an encoding of the random 
variable X. Let L\ = min L over all instantaneous codes; and let 
L 2 = min L over all uniquely decodable codes. What inequality 
relationship exists between L\ and L 2 ? 










5.2 How many fingers has a Martian? Let 
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S _ I »• • • ? S m 

\Pl ，•- . ， Pm 

The Si’s are encoded into strings from a D-symbol output alphabet 
in a uniquely decodable manner. If m = 6 and the codeword lengths 
are (l\, Z 2 , …， D = (1 ， 1, 2, 3, 2, 3), find a good lower bound on 
D. You may wish to explain the title of the problem. 

5.3 Slackness in the Kraft inequality. An instantaneous code has word 
lengths H … ， l m ，which satisfy the strict inequality 

m 

i=l 

The code alphabet is = {0, 1 ， 2, …， D — 1}. Show that there 
exist arbitrarily long sequences of code symbols in D* which cannot 
be decoded into sequences of codewords. 

5.4 Huffman coding. Consider the random variable 

y _ ( x l x 2 X 3 X 4 X 5 X 6 X 1 \ 

\ 0.49 0.26 0.12 0.04 0.04 0.03 0.02 ) - 

(a) Find a binary Huffman code for X. 

(b) Find the expected code length for this encoding. 

(c) Find a ternary Huffman code for X. 

5.5 More Huffman codes • Find the binary Huffman code for the 
source with probabilities ( 去， | ， | ，吾 , 吾 ). Argue that this code is 
also optimal for the source with probabilities (|U, U). 

5.6 Bad codes • Which of these codes cannot be Huffman codes for 
any probability assignment? 

(a) {0, 10,11} 

(b) {00,01, 10, 110} 

(c) { 01 ， 10 } 

5.7 Huffman 20 questions . Consider a set of n objects. Let Xi = 
1 or 0 accordingly as the ith object is good or defective. Let 

X 2 ,..., X n be independent with Pr{X/ = 1} = pi; and p\ > 
P 2 > > p n > We are asked to determine the set of all defec¬ 

tive objects. Any yes-no question you can think of is admissible. 




144 


DATA COMPRESSION 


(a) Give a good lower bound on the minimum average number of 
questions required. 

(b) If the longest sequence of questions is required by nature’s 
answers to our questions, what (in words) is the last ques¬ 
tion we should ask? What two sets are we distinguishing with 
this question? Assume a compact (minimum average length) 
sequence of questions. 

(c) Give an upper bound (within one question) on the minimum 
average number of questions required. 

5.8 Simple optimum compression of a Markov source • Consider the 
three-state Markov process U\, U 2 , … having transition matrix 



Si 

Si 

5 3 


1 

2 

1 

4 

1 

4 

S 2 

1 

4 

1 

2 

1 

4 

Si 

0 

1 

2 

1 

2 


Thus, the probability that S\ follows S 3 is equal to zero. Design 
three codes Ci, C 2 , C 3 (one for each state 1,2 and 3, each code 
mapping elements of the set of Si’s into sequences of 0’s and Vs, 
such that this Markov process can be sent with maximal compres¬ 
sion by the following scheme: 

(a) Note the present symbol X n = i. 

(b) Select code C/. 

(c) Note the next symbol X n+ \ = j and send the codeword in C/ 
corresponding to j. 

(d) Repeat for the next symbol. What is the average message length 
of the next symbol conditioned on the previous state X n = i 
using this coding scheme? What is the unconditional average 
number of bits per source symbol? Relate this to the entropy 
rate H(U) of the Markov chain. 

5.9 Optimal code lengths that require one bit above entropy. The 
source coding theorem shows that the optimal code for a random 
variable X has an expected length less than H(X) + 1. Give an 
example of a random variable for which the expected length of the 
optimal code is close to H(X) + 1 [i.e., for any € > 0, construct a 
distribution for which the optimal code has L > H(X) + 1 — 6]. 
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5.10 Ternary codes that achieve the entropy bound. A random variable 
X takes on m values and has entropy H{X). An instantaneous 
ternary code is found for this source, with average length 

L = = H 3 (X). (5.145) 

log 2 3 


(a) Show that each symbol of X has a probability of the form 3~ l 
for some i. 

(b) Show that m is odd. 

5.11 Suffix condition. Consider codes that satisfy the suffix condition, 
which says that no codeword is a suffix of any other codeword. 
Show that a suffix condition code is uniquely decodable, and show 
that the minimum average length over all codes satisfying the suffix 
condition is the same as the average length of the Huffman code 
for that random variable. 

5.12 Shannon codes and Huffman codes • Consider a random variable 

X that takes on four values with probabilities 長 ). 

(a) Construct a Huffman code for this random variable. 

(b) Show that there exist two different sets of optimal lengths 
for the codewords; namely, show that codeword length assign¬ 
ments (1, 2, 3, 3) and (2, 2, 2, 2) are both optimal. 

(c) Conclude that there are optimal codes with codeword lengths 
for some symbols that exceed the Shannon code length 

- 

5.13 Twenty questions • Player A chooses some object in the universe, 
and player B attempts to identify the object with a series of yes-no 
questions. Suppose that player B is clever enough to use the code 
achieving the minimal expected length with respect to player A’s 
distribution. We observe that player B requires an average of 38.5 
questions to determine the object. Find a rough lower bound to the 
number of objects in the universe. 

5.14 Huffman code. Find the (a) binary and (b) ternary Huffman codes 
for the random variable X with probabilities 


P = 


1 2 3 4 5 6 \ 

21 ' 21 ' 21 ? 21 ? 21 5 21 / 


(c) Calculate L = J2 A 乂 • in each case. 
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5.15 Huffman codes 

(a) Construct a binary Huffman code for the following distribu¬ 
tion on five symbols: p = (0.3, 0.3, 0.2, 0.1, 0.1). What is the 
average length of this code? 

(b) Construct a probability distribution p, on five symbols for 
which the code that you constructed in part (a) has an average 
length (under p’）equal to its entropy Z/(p’). 

5.16 Huffman codes • Consider a random variable X that takes six val- 
ues {A, B, C, D, E, F} with probabilities 0.5, 0.25, 0.1, 0.05, 0.05, 

and 0.05, respectively. 

(a) Construct a binary Huffman code for this random variable. 
What is its average length? 

(b) Construct a quaternary Huffman code for this random variable 
[i.e., a code over an alphabet of four symbols (call them a, b, c 
and d)]. What is the average length of this code? 

(c) One way to construct a binary code for the random variable 
is to start with a quaternary code and convert the symbols into 
binary using the mapping a ^ 00, b ^ 01, c ^ 10, and d 
11. What is the average length of the binary code for the random 
variable above constructed by this process? 

(d) For any random variable X, let L// be the average length of 
the binary Huffman code for the random variable, and let Lqb 
be the average length code constructed by first building a qua¬ 
ternary Huffman code and converting it to binary. Show that 


Lh < Lqb < Lh + 2. (5.146) 

(e) The lower bound in the example is tight. Give an example 
where the code constructed by converting an optimal quaternary 
code is also the optimal binary code. 

(f) The upper bound (i.e., Lqb < Lh + 2) is not tight. In fact, a 
better bound is Lqb < Lh + Prove this bound, and provide 
an example where this bound is tight. 


5.17 


Data compression. Find an optimal set of binary codeword 
lengths Zi, / 2 ,... (minimizing ^ piU) for an instantaneous code 


for each of the following probability mass functions: 

W P —、4 i ， 41， 41’ 41， 4 i ; 

( b ) P =( 备，（苍)(忐)，（苍)(忐) 2 ,（苍)(忐 ) 3 
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5.18 Classes of codes • Consider the code {0, 01}. 

(a) Is it instantaneous? 

(b) Is it uniquely decodable? 

(c) Is it nonsingular? 


5.19 The game of Hi-Lo 

(a) A computer generates a number X according to a known proba¬ 
bility mass function p(x), x G {1, 2,..., 100}. The player asks 
a question, “Is X = /?’’ and is told “Yes,” “You’re too high,” 
or “You’re too low.” He continues for a total of six questions. 
If he is right (i.e., he receives the answer “Yes”）during this 
sequence, he receives a prize of value v(X). How should the 
player proceed to maximize his expected winnings? 

(b) Part (a) doesn’t have much to do with information theory. Con¬ 
sider the following variation: X 〜 p(x), prize = p{x) 

known, as before. But arbitrary yes-no questions are asked 
sequentially until X is determined. (“Determined” doesn’t mean 
that a “Yes” answer is received.) Questions cost 1 unit each. 
How should the player proceed? What is the expected payoff? 

(c) Continuing part (b), what if v(x) is fixed but p(x) can be 
chosen by the computer (and then announced to the player)? 
The computer wishes to minimize the player’s expected return. 
What should p(x) be? What is the expected return to the 
player? 


5.20 Huffman codes with costs• Words such as “Run!” ， “Help!”，and 
“Fire!” are short, not because they are used frequently, but perhaps 
because time is precious in the situations in which these words are 
required. Suppose that X = i with probability pi , i = 1, 2,..., m. 
Let U be the number of binary symbols in the codeword associated 
with X = /, and let c/ denote the cost per letter of the codeword 
when X = i. Thus, the average cost C of the description of X is 
C = Pidii. 

(a) Minimize C over all /i, I 2 , ..., / m such that ^ 2~ li < 1. Ignore 

any implied integer constraints on /, • Exhibit the minimizing 
/*, ..., and the associated minimum value C*. 

(b) How would you use the Huffman code procedure to minimize 
C over all uniquely decodable codes? Let Cnuffman denote this 


minimum. 
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(c) Can you show that 



C* < CHuffman < C* + Pi^ 
i=l 


5.21 Conditions for unique decodability. Prove that a code C is 
uniquely decodable if (and only if) the extension 

C k (xi, x 2 ,..., x k ) = C(xi)C(x 2 ) - - - C(x k ) 

is a one-to-one mapping from ?& to Z)* for every k > l. (The “only 
if’ part is obvious.) 

5.22 Average length of an optimal code. Prove that L(p\, ..., p m ), 
the average codeword length for an optimal D-ary prefix code for 
probabilities {p \,..., p m ), is a continuous function of p\,..., p m . 
This is true even though the optimal code changes discontinuously 
as the probabilities vary. 

5.23 Unused code sequences . Let C be a variable-length code that 
satisfies the Kraft inequality with an equality but does not satisfy 
the prefix condition. 

(a) Prove that some finite sequence of code alphabet symbols is 
not the prefix of any sequence of codewords. 

(b) (Optional) Prove or disprove: C has infinite decoding delay. 

5.24 Optimal codes for uniform distributions • Consider a random vari¬ 
able with m equiprobable outcomes. The entropy of this informa¬ 
tion source is obviously log 2 m bits. 

(a) Describe the optimal instantaneous binary code for this source 
and compute the average codeword length L m . 

(b) For what values of m does the average codeword length L m 
equal the entropy H = log 2 ml 

(c) We know that L < // + 1 for any probability distribution. The 
redundancy of a variable-length code is defined to be p = 
L — H • For what value(s) of m, where 2 k < m < 2 k+l , is the 
redundancy of the code maximized? What is the limiting value 
of this worst-case redundancy as m ^ oo? 

5.25 Optimal codeword lengths • Although the codeword lengths of an 
optimal variable-length code are complicated functions of the mes¬ 
sage probabilities {p\, p 2 , ..., p m }, it can be said that less probable 
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symbols are encoded into longer codewords. Suppose that the mes¬ 
sage probabilities are given in decreasing order, p\ > p 2 ^ > 

Pm. 

(a) Prove that for any binary Huffman code, if the most probable 
message symbol has probability p\ that symbol must be 
assigned a codeword of length 1. 

(b) Prove that for any binary Huffman code, if the most probable 
message symbol has probability p\ < i that symbol must be 
assigned a codeword of length > 2. 

5.26 Merges. Companies with values W\, W 2 , …， W m are merged as 
follows. The two least valuable companies are merged, thus form¬ 
ing a list of m — 1 companies. The value of the merge is the 
sum of the values of the two merged companies. This contin¬ 
ues until one supercompany remains. Let V equal the sum of 
the values of the merges. Thus, V represents the total reported 
dollar volume of the merges. For example, if W = (3, 3, 2, 2), 
the merges yield (3, 3, 2, 2) ^ (4, 3, 3) ^ (6, 4) —> (10) and V = 

4 + 6 + 10 = 20 . 

(a) Argue that V is the minimum volume achievable by sequences 
of pairwise merges terminating in one supercompany. (Hint: 
Compare to Huffman coding.) 

(b) Let W = ^Wi, Wi = Wi/W, and show that the minimum 
merge volume V satisfies 

WH(W) <V < WH(W) + W. (5.147) 

5.27 Sardinas - Patterson test for unique decodability. A code is not 
uniquely decodable if and only if there exists a finite sequence of 
code symbols which can be resolved into sequences of codewords 
in two different ways. That is, a situation such as 


1 外 1 

^2 1 

Ab ■■ 

A m 1 

l^l b 2 

1 b 3 


B n 1 


must occur where each A/ and each is a codeword. Note that 
B\ must be a prefix of A\ with some resulting “dangling suffix.” 
Each dangling suffix must in turn be either a prefix of a codeword 
or have another codeword as its prefix, resulting in another dan¬ 
gling suffix. Finally, the last dangling suffix in the sequence must 
also be a codeword. Thus, one can set up a test for unique decod¬ 
ability (which is essentially the Sardinas- Patterson test [456]) in 
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the following way: Construct a set S of all possible dangling suf¬ 
fixes. The code is uniquely decodable if and only if S contains no 
codeword. 

(a) State the precise rules for building the set S. 

(b) Suppose that the codeword lengths are //, / = 1, 2,..., m. Find 
a good upper bound on the number of elements in the set S. 

(c) Determine which of the following codes is uniquely decodable: 

⑴ {0, 10,11} 

(ii) {0,01,11} 

(iii) {0,01,10} 

(iv) {0,01} 

(v) {00,01, 10, 11} 

(vi) {110,11,10} 

(vii) {110, 11, 100,00, 10} 


(d) For each uniquely decodable code in part (c), construct, if pos¬ 
sible, an infinite encoded sequence with a known starting point 
such that it can be resolved into codewords in two different 
ways. (This illustrates that unique decodability does not imply 
finite decodability.) Prove that such a sequence cannot arise in 
a prefix code. 

5.28 Shannon code. Consider the following method for generating a 
code for a random variable X that takes on m values {1, 2,..., m} 
with probabilities pu P2, … ， Pm. Assume that the probabilities are 
ordered so that pi > /?2 ^ * * * ^ Pm- Define 


F i = J^p k , (5.148) 

k=l 


the sum of the probabilities of all symbols less than i. Then the 
codeword for i is the number F[ G [0, 1] rounded off to U bits, 
where =「log *1. 

(a) Show that the code constructed by this process is prefix-free 
and that the average length satisfies 


H(X) <L< H(X) + l. (5.149) 

(b) Construct the code for the probability distribution (0.5, 0.25, 
0.125, 0.125). 
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5.29 Optimal codes for dyadic distributions. For a Huffman code tree, 
define the probability of a node as the sum of the probabilities of 
all the leaves under that node. Let the random variable X be drawn 
from a dyadic distribution [i.e., p{x) = 2~ l , for some /, for all 
x e X]. Now consider a binary Huffman code for this distribution. 

(a) Argue that for any node in the tree, the probability of the left 
child is equal to the probability of the right child. 

(b) Let Xi, X 2 ,... ,X n be drawn i.i.d •〜 p(x). Using the Huff¬ 

man code for p(x), we map X\, X 2 ,..., X n to a sequence 
of bits Fi, Y k(Xl ,X 2 ,...,x n )- (The length of this sequence 

will depend on the outcome X\, X 2 , …， X n .) Use part (a) to 
argue that the sequence m 2 , … forms a sequence of fair coin 
flips [i.e., that Pr{F/ = 0} = Pr{F/ = 1} = |, independent of 
Y\,Y 2 ,..., Thus, the entropy rate of the coded sequence 
is 1 bit per symbol. 

(c) Give a heuristic argument why the encoded sequence of bits 
for any code that achieves the entropy bound cannot be com¬ 
pressible and therefore should have an entropy rate of 1 bit per 
symbol. 

5.30 Relative entropy is cost of miscoding • Let the random variable X 
have five possible outcomes {1, 2, 3, 4, 5}. Consider two distribu¬ 
tions p(x) and q(x) on this random variable. 


Symbol 

P(x) 

q(x) 

Ci(x) 

c 2 (x) 

1 

1 

2 

1 

2 

0 

0 

2 

1 

4 

1 

8 

10 

100 

3 

1 

8 

1 

8 

110 

101 

4 

1 

T6 

1 

8 

1110 

110 

5 

1 

T6 

1 

8 

1111 

111 


(a) Calculate H(p), H(q), D(p\\q), and D(q\\p). 

(b) The last two columns represent codes for the random variable. 
Verify that the average length of C\ under p is equal to the 
entropy H(p). Thus, C\ is optimal for p. Verify that C 2 is 
optimal for q. 

(c) Now assume that we use code C 2 when the distribution is p. 
What is the average length of the codewords. By how much 
does it exceed the entropy pi 

(d) What is the loss if we use code C\ when the distribution is ql 
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5.31 Nonsingular codes • The discussion in the text focused on instan¬ 
taneous codes, with extensions to uniquely decodable codes. Both 
these are required in cases when the code is to be used repeatedly 
to encode a sequence of outcomes of a random variable. But if 
we need to encode only one outcome and we know when we have 
reached the end of a codeword, we do not need unique decod- 
ability — the fact that the code is nonsingular would suffice. For 
example, if a random variable X takes on three values, a, b, and c, 
we could encode them by 0, 1, and 00. Such a code is nonsingular 
but not uniquely decodable. 

In the following, assume that we have a random variable X which 
takes on m values with probabilities pu P2, …， Pm and that the 
probabilities are ordered so that pi > P2 ^ * * * ^ Pm- 

(a) By viewing the nonsingular binary code as a ternary code with 
three symbols, 0, 1, and “STOP,” show that the expected length 
of a nonsingular code L\-\ for a random variable X satisfies the 
following inequality: 


h 2 (X) i 


(5.150) 


where H 2 (X) is the entropy of X in bits. Thus, the average 
length of a nonsingular code is at least a constant fraction of 
the average length of an instantaneous code. 

(b) Let Linst be the expected length of the best instantaneous code 
and be the expected length of the best nonsingular code 
for X. Argue that L* ：1 < L^ ST < H(X) + l. 

(c) Give a simple example where the average length of the non¬ 
singular code is less than the entropy. 

(d) The set of codewords available for a nonsingular code is {0, 1, 
00, 01, 10, 11, 000, • • •}. Since L\ : \ = Y17=\ Pil“ show that this 
is minimized if we allot the shortest codewords to the most 
probable symbols. Thus, li = l 2 = 1, h = U = h = h = 2, etc. 
Show that in general U = 「log (4 + 1)1, and therefore LI,= 

r i °g(5+ i )i- " 

(e) Part (d) shows that it is easy to find the optimal nonsin¬ 
gular code for a distribution. However, it is a little more 
tricky to deal with the average length of this code. We now 
bound this average length. It follows from part (d) that L*.j > 
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L = Y1T=\ Pi ^°§ (1 + 1). Consider the difference 

mm / . 

F(P) = H(X) - L = - ^ pi log pi - ^ /j ( - log ( ^ + 1 

i=l i=l \ 

(5.151) 

Prove by the method of Lagrange multipliers that the maximum 
of F(p) occurs when p/ = c/(i + 2), where c = l/(H m+ 2 — 
H 2 ) and Hk is the sum of the harmonic series: 

k i 

H k = J2~- (5-152) 

i=l 1 

(This can also be done using the nonnegativity of relative 
entropy.) 

(f) Complete the arguments for 

H(X) - L* ! < H(X) - L (5.153) 

< log(2(// m+2 - H 2 )). (5.154) 

Now it is well known (see, e.g., Knuth [315]) that Hk ^ Ink 
(more precisely, H k = lnk + y + ^ - + - 6, where 

0 < 6 < 1 /252n 6 , and y = Euler’s constant = 0.577 …）. 
Using either this or a simple approximation that <\nk + l, 
which can be proved by integration of +， it can be shown that 
H{X) - L* j < log log m + 2. Thus, we have 

H(X)~ log log \X\-2<L\. a < H(X) + l. (5.155) 

A nonsingular code cannot do much better than an instantaneous 
code! 

5.32 Bad wine. One is given six bottles of wine. It is known that 
precisely one bottle has gone bad (tastes terrible). From inspection 
of the bottles it is determined that the probability pi that the /th 
bottle is bad is given by (pi, p 2 , ...,p^)= (m H 去 ). 
Tasting will determine the bad wine. Suppose that you taste the 
wines one at a time. Choose the order of tasting to minimize the 
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expected number of tastings required to determine the bad bottle. 
Remember, if the first five wines pass the test, you don’t have to 
taste the last. 

(a) What is the expected number of tastings required? 

(b) Which bottle should be tasted first? 

Now you get smart. For the first sample, you mix some of the wines 
in a fresh glass and sample the mixture. You proceed, mixing and 
tasting, stopping when the bad bottle has been determined. 

(a) What is the minimum expected number of tastings required to 
determine the bad wine? 

(b) What mixture should be tasted first? 

5.33 Huffman V5*. Shannon. A random variable X takes on three values 
with probabilities 0.6, 0.3, and 0.1. 

(a) What are the lengths of the binary Huffman codewords for 
XI What are the lengths of the binary Shannon codewords 

(/(x) =「log ( 点 ) ])for x? 

(b) What is the smallest integer D such that the expected Shannon 
codeword length with a D-ary alphabet equals the expected 
Huffman codeword length with a D-ary alphabet? 

5.34 Huffman algorithm for tree construction. Consider the following 
problem: m binary signals S\, S 2 , • •., S m are available at times 
T\ < T 2 < - • < T m , and we would like to find their sum 5^ ㊉ 5^ ㊉ 
…㊉ S m using two-input gates, each gate with one time unit delay, 
so that the final result is available as quickly as possible. A simple 
greedy algorithm is to combine the earliest two results, forming 
the partial result at time max(Ti, 7^) + 1. We now have a new 
problem with 5^ ㊉ Si, 5*3,, S m , available at times max(7\, T 2 ) + 
1, 73 ,..., T m . We can now sort this list of T’s and apply the same 
merging step again, repeating this until we have the final result. 

(a) Argue that the foregoing procedure is optimal, in that it con¬ 
structs a circuit for which the final result is available as quickly 
as possible. 

(b) Show that this procedure finds the tree that minimizes 

C(r) =max(7} +Z /)， (5.156) 

i 

where 7/ is the time at which the result allotted to the ith leaf 
is available and // is the length of the path from the /th leaf to 
the root. 



(c) Show that 
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C{T) > log 2 



for any tree T. 

(d) Show that there exists a tree such that 



(5.157) 


(5.158) 


Thus, log 2 (D, 2^') is the analog of entropy for this problem. 

5.35 Generating random variables . One wishes to generate a random 
variable X 


X = 


with probability p 
with probability I — p. 


(5.159) 


You are given fair coin flips Z\, Z 2 ,... . Let N be the (random) 
number of flips needed to generate X. Find a good way to use 
Zi, Z 2 ,... to generate X. Show that EN < 2. 

5.36 Optimal word lengths. 

(a) Can / = (1, 2, 2) be the word lengths of a binary Huffman 
code. What about (2,2,3,3)? 

(b) What word lengths I = (H …) can arise from binary Huff¬ 
man codes? 


5.37 Codes. Which of the following codes are 

(a) Uniquely decodable? 

(b) Instantaneous? 


Cl = {00,01,0} 

C 2 = {00,01 ， 100, 101 ， 11} 

C 3 = {0, 10, 110, 1110, •••} 

c 4 = {0, 00, 000, 0000} 

5.38 Huffman. Find the Huffman D-ary code for (pi, p 2 , P 3 , P 4 , ps, 
内 = (B ， B ， B ， B ， B ， B) and the ex P ected word length 

(a) For D = 2. 

(b) For D = 4. 
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C x = 


C x = 


C(x) = 


and find the entropy rate H(Z). 

5.41 Optimal codes . Let /i， /〗， .• • ， /io be the binary Huffman code¬ 
word lengths for the probabilities pi > /?2 > • • • > pw. Suppose 
that we get a new distribution by splitting the last probability 


and find the entropy rate H(Z). 
(c) Finally, let the code be 


Let X\, X 2 ,... be independent, identically distributed according 
to this distribution and let Z 1 Z 2 Z 3 • • • = C(X\)C(X 2 ) • • • be the 
string of binary symbols resulting from concatenating the corre¬ 
sponding codewords. For example, 122 becomes 01010. 

(a) Find the entropy rate H(X) and the entropy rate H{Z) in bits 
per symbol. Note that Z is not compressible further. 

(b) Now let the code be 


5.39 Entropy of encoded bits. Let C : X —— > {0, 1} be a nonsingular 
but nonuniquely decodable code. Let X have entropy H{X). 

(a) Compare H(C(X)) to H(X). 

(b) Compare H(C(X n )) to H(X n ). 

5.40 Code rate. Let X be a random variable with alphabet {1,2,3} 
and distribution 


X 


1 with probability ^ 

2 with probability | 

3 with probability 


The data compression code for X assigns codewords 


12 3 
II II = 

X X K 

ififif 


1 o 


12 3 
-- - 
X X K 
ififif 


001001 


12 3 
-- - 
X X K 
ififif 


o 


o 1 

IX IX 
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mass. What can you say about the optimal binary codeword lengths 
/i, Z 2 ,..., hi for the probabilities pu P 2 , … ， P 9 , otp\o, (1 — a)p\o, 
where 0 < a < 1. 

5.42 Ternary codes • Which of the following codeword lengths can be 
the word lengths of a 3-ary Huffman code, and which cannot? 

(a) (1,2,2, 2, 2) 

(b) (2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3) 

5.43 Piecewise Huffman . Suppose the codeword that we use to 
describe a random variable X 〜 p(x) always starts with a symbol 
chosen from the set {A, B, C}, followed by binary digits {0, 1}. 
Thus, we have a ternary code for the first symbol and binary 
thereafter. Give the optimal uniquely decodable code (minimum 
expected number of symbols) for the probability distribution 

/16 15 12 10 8 8 \ … ~ 

P = \69' 69' 69' 69' 69' 69/ ' ( ) 

5.44 Huffman. Find the word lengths of the optimal binary encoding 
° f 厂 = (丽，丽，…， Too ) * 

5.45 Random 20 questions. Let X be uniformly distributed over {1,2, 

...,m}. Assume that m = 2 n • We ask random questions: Is X g 5i? 
Is X g 52?... until only one integer remains. All 2 m subsets S of 
{1 ， 2, … ， m} are equally likely to be asked. 

(a) Without loss of generality, suppose that X = 1 is the random 
object. What is the probability that object 2 yields the same 
answers for k questions as does object 1? 

(b) What is the expected number of objects in {2, 3,, m) that 
have the same answers to the questions as does the correct 
object 1? 

(c) Suppose that we ask n + Jn random questions. What is the 
expected number of wrong objects agreeing with the answers? 

(d) Use Markov’s inequality Pr{X > tfi} < j, to show that the 
probability of error (one or more wrong object remaining) goes 
to zero as n —— > oo. 


HISTORICAL NOTES 


The foundations for the material in this chapter can be found in Shan¬ 
non^ original paper [469], in which Shannon stated the source coding 
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theorem and gave simple examples of codes. He described a simple code 
construction procedure (described in Problem 5.5.28), which he attributed 
to Fano. This method is now called the Shannon-Fano code construction 
procedure. 

The Kraft inequality for uniquely decodable codes was first proved 
by McMillan [385]; the proof given here is due to Karush [306]. The 
Huffman coding procedure was first exhibited and proved to be optimal 
by Huffman [283]. 

In recent years, there has been considerable interest in designing source 
codes that are matched to particular applications, such as magnetic record¬ 
ing. In these cases, the objective is to design codes so that the output 
sequences satisfy certain properties. Some of the results for this problem 
are described by Franaszek [219], Adler et al. [5] and Marcus [370]. 

The arithmetic coding procedure has its roots in the Shannon-Fano 
code developed by Elias (unpublished), which was analyzed by Jelinek 
[297]. The procedure for the construction of a prefix-free code described 
in the text is due to Gilbert and Moore [249]. The extension of the 
Shannon -Fano -Elias method to sequences is based on the enumerative 
methods in Cover [120] and was described with finite-precision arithmetic 
by Pasco [414] and Rissanen [441]. The competitive optimality of Shan¬ 
non codes was proved in Cover [125] and extended to Huffman codes by 
Feder [203]. Section 5.11 on the generation of discrete distributions from 
fair coin flips follows the work of Knuth and Yao[317]. 


CHAPTER 6 

GAMBLING AND DATA 
COMPRESSION 


At first sight, information theory and gambling seem to be unrelated. 
But as we shall see, there is strong duality between the growth rate of 
investment in a horse race and the entropy rate of the horse race. Indeed, 
the sum of the growth rate and the entropy rate is a constant. In the process 
of proving this, we shall argue that the financial value of side information 
is equal to the mutual information between the horse race and the side 
information. The horse race is a special case of investment in the stock 
market, studied in Chapter 16. 

We also show how to use a pair of identical gamblers to compress a 
sequence of random variables by an amount equal to the growth rate of 
wealth on that sequence. Finally, we use these gambling techniques to 
estimate the entropy rate of English. 

6.1 THE HORSE RACE 

Assume that m horses run in a race. Let the ith horse win with probability 
Pi， If horse i wins, the payoff is 0 [ for 1 (i.e., an investment of 1 dollar 
on horse i results in oi dollars if horse i wins and 0 dollars if horse i 
loses). 

There are two ways of describing odds: a-for-1 and b-to-l. The first 
refers to an exchange that takes place before the race 一 the gambler puts 
down 1 dollar before the race and at a-for-1 odds will receive a dollars 
after the race if his horse wins, and will receive nothing otherwise. The 
second refers to an exchange after the race — at b-to-l odds, the gambler 
will pay 1 dollar after the race if his horse loses and will pick up b dollars 
after the race if his horse wins. Thus, a bet at b-to-l odds is equivalent to 
a bet at a-for-1 odds if b = a — l. For example, fair odds on a coin flip 
would be 2-for-l or 1-to-l, otherwise known as even odds. 
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We assume that the gambler distributes all of his wealth across the 
horses. Let b[ be the fraction of the gambler’s wealth invested in horse /, 
where hi > 0 and ^b[ = 1. Then if horse i wins the race, the gambler 
will receive Oi times the amount of wealth bet on horse i. All the other 
bets are lost. Thus, at the end of the race, the gambler will have multiplied 
his wealth by a factor b[ 0 [ if horse i wins, and this will happen with prob¬ 
ability pi ， For notational convenience, we use b(i) and bi interchangeably 
throughout this chapter. 

The wealth at the end of the race is a random variable, and the gambler 
wishes to “maximize” the value of this random variable. It is tempting to 
bet everything on the horse that has the maximum expected return (i.e., 
the one with the maximum piOi). But this is clearly risky, since all the 
money could be lost. 

Some clarity results from considering repeated gambles on this race. 
Now since the gambler can reinvest his money, his wealth is the product 
of the gains for each race. Let S n be the gambler’s wealth after n races. 
Then 


^= n 只 兄), 


( 6 . 1 ) 


where S(X) = b(X)o(X) is the factor by which the gambler’s wealth is 
multiplied when horse X wins. 

Definition The wealth relative S{X) = b(X)o(X) is the factor by which 
the gambler’s wealth grows if horse X wins the race. 

Definition The doubling rate of a horse race is 

m 

W(b,p) = E(\og S(X)) = J^p k logb k o k _ (6.2) 

k=\ 

The definition of doubling rate is justified by the following theorem. 

Theorem 6.1.1 Let the race outcomes X\, X 2 ,... be i.i.d. 〜 p{x). 
Then the wealth of the gambler using betting strategy b grows exponen¬ 
tially at rate VK(b, p); that is ， 




(6.3) 
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Proof: Functions of independent random variables are also independent, 
and hence log 5(Xi), log S(X 2 ), … are i.i.d. Then, by the weak law of 
large numbers, 

1 1 n 

—log S n = — log S(Xi) —^ ^(log S(X)) in probability. (6.4) 
n n 

i=l 

Thus, 

S n = 2 譜 (b ， p) . □ (6.5) 

Now since the gambler’s wealth grows as 2 nW(b ' p) , we seek to maximize 
the exponent W(b, p) over all choices of the portfolio b. 

Definition The optimum doubling rate W*(p) is the maximum doubling 
rate over all choices of the portfolio b: 

m 

VK*(p) = max VK(b, p) = max pi log b[ 0 [. (6.6) 

b b:bi>0, Eibi=l^ 

We maximize W(b,p) as a function of b subject to the constraint 
^bi = 1. Writing the functional with a Lagrange multiplier and changing 
the base of the logarithm (which does not affect the maximizing b), we 
have 

J(b) = E Pi In biOi + k bi. (6.7) 

Differentiating this with respect to bi yields 
d J pi . 

ttt - = + 人 ， i — 1, 2,..., m. (6.8) 

obi bi 

Setting the partial derivative equal to 0 for a maximum, we have 

bi = -y. (6.9) 

Substituting this in the constraint ^bi = l yields X = —l and bi = /?/. 
Hence, we can conclude that b = p is a stationary point of the function 
7(b). To prove that this is actually a maximum is tedious if we take 


162 


GAMBLING AND DATA COMPRESSION 


second derivatives. Instead, we use a method that works for many such 
problems: Guess and verify. We verify that proportional gambling b = p 
is optimal in the following theorem. Proportional gambling is known as 
Kelly gambling [308]. 

Theorem 6.1.2 (Proportional gambling is log-optimal) 
doubling rate is given by 

The optimum 

W*(p) = J2Pi l °SOi - H(p) 

(6.10) 

and is achieved by the proportional gambling scheme b* = 

P 

Proof: We rewrite the function VK(b, p) in a form in which the maximum 
is obvious: 

W(b, p) = pi log bjOj 

(6.11) 

=log (备 

(6.12) 

=^2 Pi logOi - H(p) - D(p||b) 

(6.13) 

< ^ A log a - H(p), 

(6.14) 


with equality iff p = b (i.e., the gambler bets on each horse in proportion 
to its probability of winning). □ 

Example 6.1.1 Consider a case with two horses, where horse 1 wins 
with probability p\ and horse 2 wins with probability p 2 . Assume even 
odds (2-for-l on both horses). Then the optimal bet is proportional bet¬ 
ting (i.e., b\ = p\, bi = pi)- The optimal doubling rate is VK*(p) = 
Pi log Oi — H(p) = 1 — //(p), and the resulting wealth grows to infin¬ 
ity at this rate: 

S„ = 2" (1 - 聊 . (6.15) 

Thus, we have shown that proportional betting is growth rate optimal 
for a sequence of i.i.d. horse races if the gambler can reinvest his wealth 
and if there is no alternative of keeping some of the wealth in cash. 

We now consider a special case when the odds are fair with respect to 
some distribution (i.e., there is no track take and 士 = 1). In this case, 

we write r[ = —, where r/ can be interpreted as a probability mass function 
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over the horses. (This is the bookie’s estimate of the win 
With this definition, we can write the doubling rate as 

probabilities.) 

W(b ， p) = Pi l0 § b i°i 

(6.16) 

- Pi log ] 

^ \Pi n ) 

(6.17) 

= D(p||r)-Z)(p||b). 

(6.18) 


This equation gives another interpretation for the relative entropy dis¬ 
tance: The doubling rate is the difference between the distance of the 
bookie’s estimate from the true distribution and the distance of the gam¬ 
bler^ estimate from the true distribution. Hence, the gambler can make 
money only if his estimate (as expressed by b) is better than the bookie’s. 

An even more special case is when the odds are m-for-1 on each horse. 
In this case, the odds are fair with respect to the uniform distribution and 
the optimum doubling rate is 

H/*(p) = D = logm _//(p)_ (6.19) 

In this case we can clearly see the duality between data compression and 
the doubling rate. 

Theorem 6.1.3 {Conservation theorem) For uniform fair odds ， 

W*(p) + H(p) = logm. (6.20) 

Thus，the sum of the doubling rate and the entropy rate is a constant. 

Every bit of entropy decrease doubles the gambler’s wealth. Low entropy 
races are the most profitable. 

In the analysis above, we assumed that the gambler was fully invested. 
In general, we should allow the gambler the option of retaining some of 
his wealth as cash. Let Z?(0) be the proportion of wealth held out as cash, 
and b(l), b(2 ),..., b{m) be the proportions bet on the various horses. 
Then at the end of a race, the ratio of final wealth to initial wealth (the 
wealth relative) is 


S(X) = b(0) + b(X)o(X). (6.21) 

Now the optimum strategy may depend on the odds and will not necessar¬ 
ily have the simple form of proportional gambling. We distinguish three 
subcases: 
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1. Fair odds with respect to some distribution : ^ = 1. For fair odds, 

the option of withholding cash does not change the analysis. This is 
because we can get the effect of withholding cash by betting b[ = 
on the ith horse, i = 1, 2,..., m. Then S(X) = 1 irrespective of 
which horse wins. Thus, whatever money the gambler keeps aside 
as cash can equally well be distributed over the horses, and the 
assumption that the gambler must invest all his money does not 
change the analysis. Proportional betting is optimal. 

2. Superfair odds: ^ < 1. In this case, the odds are even better than 

fair odds, so one would always want to put all one’s wealth into the 
race rather than leave it as cash. In this race, too, the optimum 
strategy is proportional betting. However, it is possible to choose 
b so as to form a Dutch book by choosing bi = where c = 

1/ V —, to get Oibi = c, irrespective of which horse wins. With 
this allotment, one has wealth S(X) = 1/ E 士 > 1 with probability 
1 (i.e., no risk). Needless to say, one seldom finds such odds in 
real life. Incidentally, a Dutch book, although risk-free, does not 
optimize the doubling rate. 

3. Subfair odds: ^ > 1. This is more representative of real life. The 

organizers of the race track take a cut of all the bets. In this case it 
is optimal to bet only some of the money and leave the rest aside 
as cash. Proportional gambling is no longer log-optimal. A paramet¬ 
ric form for the optimal strategy can be found using Kuhn-Tucker 
conditions (Problem 6.6.2); it has a simple “water-filling” interpre¬ 
tation. 

6.2 GAMBLING AND SIDE INFORMATION 

Suppose the gambler has some information that is relevant to the outcome 
of the gamble. For example, the gambler may have some information 
about the performance of the horses in previous races. What is the value 
of this side information? 

One definition of the financial value of such information is the increase 
in wealth that results from that information. In the setting described in 
Section 6.1 the measure of the value of information is the increase in the 
doubling rate due to that information. We will now derive a connection 
between mutual information and the increase in the doubling rate. 

To formalize the notion, let horse X e {1,2, … ， m} win the race with 
probability p(x) and pay odds of o(x) for 1. Let (X, Y) have joint 
probability mass function p(x ， y). Let b(x\y) > 0, b{x\y) = 1 be an 
arbitrary conditional betting strategy depending on the side information 
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Y, where b{x\y) is the proportion of wealth bet on horse x when y is 
observed. As before, let b{x) > 0, = 1 denote the unconditional 

betting scheme. 

Let the unconditional and the conditional doubling rates be 

VK*(X) = maxV^ p(x) \ogb{x)o{x), (6.22) 

b ⑴ ^ 


W^(X\Y) = max p(x, y) logb(x\y)o(x) (6.23) 

Hx\y) 

x,y 

and let 

AW = W^(X\Y) - W*(Z). (6.24) 

We observe that for (X/, Yi) i.i.d. horse races, wealth grows like 2 nW *^ x ^ 
with side information and like 2 nW *^ without side information. 

Theorem 6.2.1 The increase AW in doubling rate due to side infor¬ 
mation Y for a horse race X is 

= /(Z; Y). (6.25) 

Proof: With side information, the maximum value of V^*(X|y) with 
side information Y is achieved by conditionally proportional gambling 
[i.e., b*(x\y) = p(x\y)]. Thus, 

W^iXy^) = max Ellog S] = max p(x, y) log o(x)b(x\y) (6.26) 

Hx\y) L 」 b(x\y ) 匕 ' 

= ^2p(x, y) \ogo{x)p{x\y) (6.27) 

= ^p{x)\ogo{x) - H{X\Y). (6.28) 

Without side information, the optimal doubling rate is 

= J]p(x)logo(x) - H{X). (6.29) 

Thus, the increase in doubling rate due to the presence of side information 
Y is 


AW = W^(X\Y) - W^(X) = H(X) - H{X\Y) = /(X; Y). □ (6.30) 
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Hence, the increase in doubling rate is equal to the mutual informa¬ 
tion between the side information and the horse race. Not surprisingly, 
independent side information does not increase the doubling rate. 

This relationship can also be extended to the general stock market 
(Chapter 16). In this case, however, one can only show the inequality 
AW < /, with equality if and only if the market is a horse race. 


6.3 DEPENDENT HORSE RACES AND ENTROPY RATE 

The most common example of side information for a horse race is the 
past performance of the horses. If the horse races are independent, this 
information will be useless. If we assume that there is dependence among 
the races, we can calculate the effective doubling rate if we are allowed 
to use the results of previous races to determine the strategy for the next 
race. 

Suppose that the sequence {Z^} of horse race outcomes forms a stochas¬ 
tic process. Let the strategy for each race depend on the results of previous 
races. In this case, the optimal doubling rate for uniform fair odds is 

W\X k \X k .^X k ^..^X x ) 

=E max E[log S(X k )\X k - U X k -i, •. • ， Xi] 

lb(-\x k . u x k . 2 ,-,Xi) 」 

=logm- H(X k \X k . u X k _ 2 , … ， Xi), (6.31) 

which is achieved by b*(xk\xk-i, •. • ， x\) = p(xk\xk-i, • •. ， x\). 

At the end of n races, the gambler’s wealth is 

n 

S n = Y\S{Xi), 

i=l 

and the exponent in the growth rate (assuming m for 1 odds) is 

-E\ogS n = -YE\ogS(Xi) 
n n 

=-Vdogm - HiXilXi.u Xw ， … ， X,)) 
n 

, H(X U X 2 , … ， X n ) 

=log m - • 


(6.32) 

(6.33) 

(6.34) 

(6.35) 
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The quantity 七 H(X\, X 2 ,..., X n ) is the average entropy per race. For 
a stationary process with entropy rate H(X), the limit in (6.35) yields 

1 

lim —ElogS n + H(X) = logm. (6.36) 

«—^00 n 

Again, we have the result that the entropy rate plus the doubling rate is a 
constant. 

The expectation in (6.36) can be removed if the process is ergodic. It 
will be shown in Chapter 16 that for an ergodic sequence of horse races, 

S n = 2 nW with probability 1, (6.37) 

where W = logm — H(X) and 

1 

H(X) = lim -H(X U X 2 ,..., X n ). (6.38) 

n 

Example 6.3.7 (Red and black) In this example, cards replace horses 
and the outcomes become more predictable as time goes on. Consider the 
case of betting on the color of the next card in a deck of 26 red and 26 
black cards. Bets are placed on whether the next card will be red or black, 
as we go through the deck. We also assume that the game pays 2-for-l; 
that is, the gambler gets back twice what he bets on the right color. These 
are fair odds if red and black are equally probable. 

We consider two alternative betting schemes: 

1. If we bet sequentially, we can calculate the conditional probability 
of the next card and bet proportionally. Thus, we should bet (H) 
on (red, black) for the first card, ( 劈， |j) for the second card if the 
first card is black, and so on. 

2. Alternatively, we can bet on the entire sequence of 52 cards at once. 
There are (^) possible sequences of 26 red and 26 black cards, all 
of them equally likely. Thus, proportional betting implies that we 
put 1/( 芸 ） of our money on each of these sequences and let each 
bet “ricfe.” 

We will argue that these procedures are equivalent. For example, half 
the sequences of 52 cards start with red, and so the proportion of money 
bet on sequences that start with red in scheme 2 is also one-half, agreeing 
with the proportion used in the first scheme. In general, we can verify that 
betting 1/( 益 ） of the money on each of the possible outcomes will at each 
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stage give bets that are proportional to the probability of red and black 
at that stage. Since we bet 1/(^) of the wealth on each possible output 
sequence, and a bet on a sequence increases wealth by a factor of 2 52 on 
the sequence observed and 0 on all the others, the resulting wealth is 

* 2 52 

S ; 2 = 752 ： = 9.08. (6.39) 

( 26 ) 

Rather interestingly, the return does not depend on the actual sequence. 
This is like the AEP in that the return is the same for all sequences. All 
sequences are typical in this sense. 


6.4 THE ENTROPY OF ENGLISH 

An important example of an information source is English text. It is 
not immediately obvious whether English is a stationary ergodic process. 
Probably not! Nonetheless, we will be interested in the entropy rate of 
English. We discuss various stochastic approximations to English. As we 
increase the complexity of the model, we can generate text that looks like 
English. The stochastic models can be used to compress English text. The 
better the stochastic approximation, the better the compression. 

For the purposes of discussion, we assume that the alphabet of English 
consists of 26 letters and the space symbol. We therefore ignore punctua¬ 
tion and the difference between upper- and lowercase letters. We construct 
models for English using empirical distributions collected from samples 
of text. The frequency of letters in English is far from uniform. The most 
common letter, E, has a frequency of about 13%, and the least common 
letters, Q and Z, occur with a frequency of about 0.1%. The letter E is 
so common that it is rare to find a sentence of any length that does not 
contain the letter. [A surprising exception to this is the 267-page novel, 
Gadsby, by Ernest Vincent Wright (Lightyear Press, Boston, 1997; orig¬ 
inal publication in 1939), in which the author deliberately makes no use 
of the letter E.] 

The frequency of pairs of letters is also far from uniform. For example, 
the letter Q is always followed by a U. The most frequent pair is TH, 
which occurs normally with a frequency of about 3.7%. We can use 
the frequency of the pairs to estimate the probability that a letter fol¬ 
lows any other letter. Proceeding this way, we can also estimate higher- 
order conditional probabilities and build more complex models for the 
language. However, we soon run out of data. For example, to build 
a third-order Markov approximation, we must estimate the values of 
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p(xi\xi-i, xi- 2 , A- 3 ). There are 27 4 = 531, 441 entries in this table, and 
we would need to process millions of letters to make accurate estimates 
of these probabilities. 

The conditional probability estimates can be used to generate random 
samples of letters drawn according to these distributions (using a random 
number generator). But there is a simpler method to simulate randomness 
using a sample of text (a book, say). For example, to construct the second- 
order model, open the book at random and choose a letter at random on 
the page. This will be the first letter. For the next letter, again open the 
book at random and starting at a random point, read until the first letter is 
encountered again. Then take the letter after that as the second letter. We 
repeat this process by opening to another page, searching for the second 
letter, and taking the letter after that as the third letter. Proceeding this 
way, we can generate text that simulates the second-order statistics of the 
English text. 

Here are some examples of Markov approximations to English from 
Shannon’s original paper [472]: 

1. Zero-order approximation. (The symbols are independent and equi- 
probable.) 

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ 
FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD 

2. First-order approximation. (The symbols are independent. The fre¬ 
quency of letters matches English text.) 

OCRO HLI RGWR NMIELWIS EU LL NBNESEBYATH EEI 
ALHENHTTPA OOBTTVA NAH BRL 

3. Second-order approximation. (The frequency of pairs of letters 
matches English text.) 

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY 
ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO 
TIZIN ANDY TOBE SEACE CTISBE 

4. Third-order approximation. (The frequency of triplets of letters 
matches English text.) 

IN NO 1ST LAT WHEY CRATICT FROURE BERS GROCID 
PONDENOMEOF DEMONSTURES OF THE REPTAGIN IS 
REGOACTIONA OF CRE 
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5. Fourth-order approximation. (The frequency of quadruplets of let¬ 
ters matches English text. Each letter depends on the previous three 
letters. This sentence is from Lucky’s book, Silicon Dreams [366].) 

THE GENERATED JOB PROVIDUAL BETTER TRANDTHE DISPLAYED 
CODE, ABOVERY UPONDULTS WELL THE CODERSTIN THESTICAL 
IT DO HOCK BOTHE MERG. (INSTATES CONS ERATION. NEVER 
ANY OF PUBLE AND TO THEORY. EVENTIAL CALLEGAND TO ELAST 
BENERATED IN WITH PIES AS IS WITH THE) 

Instead of continuing with the letter models, we jump to word 
models. 

6. First-order word model. (The words are chosen independently but 
with frequencies as in English.) 

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN 
DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO 
EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE 
THESE. 

7. Second-order word model • (The word transition probabilities match 
English text.) 

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER 
THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER 
METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD 
THE PROBLEM FOR AN UNEXPECTED 


The approximations get closer and closer to resembling English. For 
example, long phrases of the last approximation could easily have occurred 
in a real English sentence. It appears that we could get a very good approx¬ 
imation by using a more complex model. These approximations could be 
used to estimate the entropy of English. For example, the entropy of the 
zeroth-order model is log 27 = 4.76 bits per letter. As we increase the 
complexity of the model, we capture more of the structure of English, 
and the conditional uncertainty of the next letter is reduced. The first- 
order model gives an estimate of the entropy of 4.03 bits per letter, while 
the fourth-order model gives an estimate of 2.8 bits per letter. But even 
the fourth-order model does not capture all the structure of English. In 
Section 6.6 we describe alternative methods for estimating the entropy of 
English. 

The distribution of English is useful in decoding encrypted English text. 
For example, a simple substitution cipher (where each letter is replaced 
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by some other letter) can be solved by looking for the most frequent letter 
and guessing that it is the substitute for E, and so on. The redundancy in 
English can be used to fill in some of the missing letters after the other 
letters are decrypted: for example, 

TH.R. _S _NLY _N_ W.YT. F_LL _N TH. V_W_LS _N TH.S S_NT_NC_. 

Some of the inspiration for Shannon’s original work on information 
theory came out of his work in cryptography during World War II. The 
mathematical theory of cryptography and its relationship to the entropy 
of language is developed in Shannon [481]. 

Stochastic models of language also play a key role in some speech 
recognition systems. A commonly used model is the trigram (second-order 
Markov) word model, which estimates the probability of the next word 
given the preceding two words. The information from the speech signal 
is combined with the model to produce an estimate of the most likely 
word that could have produced the observed speech. Random models do 
surprisingly well in speech recognition, even when they do not explicitly 
incorporate the complex rules of grammar that govern natural languages 
such as English. 

We can apply the techniques of this section to estimate the entropy rate 
of other information sources, such as speech and images. A fascinating 
nontechnical introduction to these issues may be found in the book by 
Lucky [366]. 

6.5 DATA COMPRESSION AND GAMBLING 


We now show a direct connection between gambling and data compres¬ 
sion, by showing that a good gambler is also a good data compressor. Any 
sequence on which a gambler makes a large amount of money is also a 
sequence that can be compressed by a large factor. The idea of using 
the gambler as a data compressor is based on the fact that the gambler’s 
bets can be considered to be his estimate of the probability distribution 
of the data. A good gambler will make a good estimate of the probability 
distribution. We can use this estimate of the distribution to do arithmetic 
coding (Section 13.3). This is the essential idea of the scheme described 
below. 

We assume that the gambler has a mechanically identical twin, who 
will be used for the data decompression. The identical twin will place the 
same bets on possible sequences of outcomes as the original gambler (and 
will therefore make the same amount of money). The cumulative amount 
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of money that the gambler would have made on all sequences that are 
lexicographically less than the given sequence will be used as a code 
for the sequence. The decoder will use the identical twin to gamble on 
all sequences, and look for the sequence for which the same cumulative 
amount of money is made. This sequence will be chosen as the decoded 
sequence. 

Let X\, X 2 ,..., be a sequence of random variables that we wish 
to compress. Without loss of generality, we will assume that the random 
variables are binary. Gambling on this sequence will be defined by a 
sequence of bets 


b(x k+ i I xi,x 2 , ...,x k )>0, ^ b(x k+ i I x u x 2 , ...,x k ) = 1, 

从+1 

(6.40) 

where b(xk-\~\ | xi, X 2 ,..., Xk) is the proportion of money bet at time k on 
the event that Xk+\ = Xk+i given the observed past x\,X 2 ,, xj^ Bets 
are paid at uniform odds (2-for-l). Thus, the wealth S n at the end of the 
sequence is given by 

n 

Sn = 2 ；， Y[ b ( x k I ^1, • • • . ^-l) (6.41) 

k=l 

= l n b(x\,X 2 ,... ,x n ), (6.42) 

where 

n 

b(xi,x 2 , ...,x n ) = Y\b{x k \x k -i ,.. (6.43) 

k=i 

So sequential gambling can also be considered as an assignment of proba¬ 
bilities (or bets) b(x\ ，又 2 , • • • ， x n ) > 0, v b(x \, • • • ， x n ) = 1， on the 

2 n possible sequences. 

This gambling elicits both an estimate of the true probability of the text 
sequence (p(x \,..., x n ) = S n /2 n ) as well as an estimate of the entropy 
[H = — i log p] of the text from which the sequence was drawn. We now 
wish to show that high values of wealth S n lead to high data compression. 
Specifically, we argue that if the text in question results in wealth S n , 
then log S n bits can be saved in a naturally associated deterministic data 
compression scheme. We further assert that if the gambling is log optimal, 
the data compression achieves the Shannon limit H • 
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Consider the following data compression algorithm that maps the 
text x = X 1 X 2 - • x n G {0, \} n into a code sequences c\C 2 • • • q, c/ g 
{ 0, 1}. Both the compressor and the decompressor know n. Let 
the 2 n text sequences be arranged in lexicographical order: for 
example, 0100101 < 0101101. The encoder observes the sequence 
x n = (x\, X 2 ,, x n ). He then calculates what his wealth S n (x (n)) 
would have been on all sequences x (n) < x{n) and calculates 
F{x{n)) = J ： x \ n) < x(n) 2 ~ ns n(x\n)). Clearly, F{x{n)) g [0, 1], Let k = 
\n — log S n {x(n)Y\. Now express F{x{n)) as a binary decimal to 众 -place 
accuracy: |_ 尸(义 ("))」 = - c \ c 2 • • • Ck. The sequence c{k) = (ci, q, … ， Ck) 
is transmitted to the decoder. 

The decoder twin can calculate the precise value S(x (n)) associated 
with each of the 2 n sequences x (n). He thus knows the cumulative sum 
of 2~ n S(x (n)) up through any sequence x(n). He tediously calculates 
this sum until it first exceeds .c(k). The first sequence x{n) such that 
the cumulative sum falls in the interval [.ci • • • q, .ci... + (1/2 ) 勹 is 

defined uniquely, and the size of S(x {n))/2 n guarantees that this sequence 
will be precisely the encoded x{n). 

Thus, the twin uniquely recovers x{n). The number of bits required 
is k = \n — log 5(x(n))]. The number of bits saved is n — k = 
Llog iS(;c(n))」. For proportional gambling, 5(x(n)) = 2 n p(x(n)). Thus, 
the expected number of bits is Ek = ^2 p(x(n)) |~— log p(x(n))~\ < 
單 i ，…， D + l. 

We see that if the betting operation is deterministic and is known 
both to the encoder and the decoder, the number of bits necessary to 
encode x\,x n is less than n — log S n + l. Moreover, if p(x) is known, 
and if proportional gambling is used, the description length expected is 
E(n — log S n ) < H(X \,..., X n ) + 1. Thus, the gambling results corre¬ 
spond precisely to the data compression that would have been achieved 
by the given human encoder-decoder identical twin pair. 

The data compression scheme using a gambler is similar to the idea 
of arithmetic coding (Section 13.3) using a distribution b(x\, X 2 ,..., x n ) 
rather than the true distribution. The procedure above brings out the duality 
between gambling and data compression. Both involve estimation of the 
true distribution. The better the estimate, the greater the growth rate of 
the gambler’s wealth and the better the data compression. 


6.6 GAMBLING ESTIMATE OF THE ENTROPY OF ENGLISH 

We now estimate the entropy rate for English using a human gambler to 
estimate probabilities. We assume that English consists of 27 characters 
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(26 letters and a space symbol). We therefore ignore punctuation and case 
of letters. Two different approaches have been proposed to estimate the 
entropy of English. 

1. Shannon guessing game. In this approach the human subject is 
given a sample of English text and asked to guess the next letter. 
An optimal subject will estimate the probabilities of the next letter 
and guess the most probable letter first, then the second most prob¬ 
able letter next, and so on. The experimenter records the number of 
guesses required to guess the next letter. The subject proceeds this 
way through a fairly large sample of text. We can then calculate the 
empirical frequency distribution of the number of guesses required 
to guess the next letter. Many of the letters will require only one 
guess; but a large number of guesses will usually be needed at the 
beginning of words or sentences. 

Now let us assume that the subject can be modeled as a computer 
making a deterministic choice of guesses given the past text. Then 
if we have the same machine and the sequence of guess numbers, 
we can reconstruct the English text. Just let the machine run, and if 
the number of guesses at any position is k, choose the ^th guess of 
the machine as the next letter. Hence the amount of information in 
the sequence of guess numbers is the same as in the English text. 
The entropy of the guess sequence is the entropy of English text. We 
can bound the entropy of the guess sequence by assuming that the 
samples are independent. Hence, the entropy of the guess sequence 
is bounded above by the entropy of the histogram in the experiment. 
The experiment was conducted in 1950 by Shannon [482], who 
obtained a value of 1.3 bits per symbol for the entropy of English. 

2. Gambling estimate • In this approach we let a human subject gamble 
on the next letter in a sample of English text. This allows finer 
gradations of judgment than does guessing. As in the case of a horse 
race, the optimal bet is proportional to the conditional probability 
of the next letter. The payoff is 27-for-1 on the correct letter. 

Since sequential betting is equivalent to betting on the entire 
sequence, we can write the payoff after n letters as 


S n = (27) n b(X u X 2 ,...,X n ). 


(6.44) 


Thus, after n rounds of betting, the expected log wealth satisfies 



log27 + -Elogb(X u X 2 ,...,X n ) 


(6.45) 
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log21 + - Y p(x n ) logb(x n ) 

n * J 


x n 


log 27 — - V'p(x n )log 

n * ^ 


x n 


p{x n ) 

b{x n ) 


+ -Tp(x n )\ 0 gp(x n ) 

n * J 


(6.46) 


(6.47) 


log27 - -D(p(x n )\\b(x n )) - -H(X U X 2 , ••” Xn) 
n n 

(6.48) 


<log27--//(Zi,Z 2 ,...,X n ) 

n 

<log27-//(A), 


(6.49) 

(6.50) 


where H{X) is the entropy rate of English. Thus, log 27 — E ^ log S n 
is an upper bound on the entropy rate of English. The upper bound 
estimate, H(X) = log27 — ^ log S n ， converges to H(X) with prob¬ 
ability 1 if English is ergodic and the gambler uses b(x n ) = p(x n ). 
An experiment [131] with 12 subjects and a sample of 75 letters 
from the book Jefferson the Virginian by Dumas Malone (Little, 
Brown, Boston, 1948; the source used by Shannon) resulted in an 
estimate of 1.34 bits per letter for the entropy of English. 


SUMMARY 

Doubling rate. W(b, p) = £(log5(Z)) = Y^k=\ R-log b k o k . 

Optimal doubling rate. W*(p) = maxb W(b, p). 

Proportional gambling is log-optimal 

W*(p) = max W(b, p) = pt logo / — //(p) (6.51) 


is achieved by b* = p. 

Growth rate. Wealth grows as S n =2 nW *^\ 
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Conservation law. For uniform fair odds, 

H(p) + W*(p) = logm. (6.52) 

Side information. In a horse race X, the increase AW in doubling 
rate due to side information Y is 

△ W = /(Z; Y). (6.53) 


PROBLEMS 


6.1 Horse race. Three horses run a race. A gambler offers 3-for- 
1 odds on each horse. These are fair odds under the assumption 
that all horses are equally likely to win the race. The true win 
probabilities are known to be 


/111、 

P = (Pi，P2, P3) = ( 2 ? 4 ? 4 / ' 


(6.54) 


Let b = (b\, Z? 2 , & 3 ), bi > 0, XI ^ be the amount invested on 
each of the horses. The expected log wealth is thus 


3 

W(b) = ^p,log3&. (6.55) 


(a) Maximize this over b to find b* and W*. Thus, the wealth 
achieved in repeated horse races should grow to infinity like 
2 nW * with probability 1. 

(b) Show that if instead we put all of our money on horse 1 , the 
most likely winner, we will eventually go broke with probabil¬ 
ity 1. 

6.2 Horse race with subfair odds. If the odds are bad (due to a track 
take), the gambler may wish to keep money in his pocket. Let b(0) 
be the amount in his pocket and let b(l), b(2 ),..., b{m) be the 
amount bet on horses 1, 2,… ， m，with odds o(l), o(2 ),... ， o(m), 
and win probabilities p(l), p(2), ..., p(m). Thus, the resulting 
wealth is S(x) = b(0) + b(x)o(x), with probability p(x), x = 
1, 2,..., m. 

(a) Find b* maximizing E log S if ^ 1/^(0 < 1. 
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(厂1，厂2，厂3)= 

Thus, the odds are 


( c ? i , 02, 03) = ( 4 , 4 , 2 ). 

(a) What is the entropy of the race? 

(b) Find the set of bets (b\, Z? 2 , ^ 3 ) such that the compounded 
wealth in repeated plays will grow to infinity. 

6.6 Horse race. A three-horse race has win probabilities p = 
(Pi ， P 2 , P 3 )，and odds o = (1,1,1). The gambler places bets b = 
(b\, 办 2 , bi > 0, — where bi denotes the proportion on 

wealth bet on horse i. These odds are very bad. The gambler gets 
his money back on the winning horse and loses the other bets. 
Thus, the wealth S n at time n resulting from independent gambles 
goes exponentially to zero. 

(a) Find the exponent. 


(b) Discuss b* if l/o(/) > 1. (There isn’t an easy closed-form 
solution in this case, but a “water-filling” solution results from 
the application of the Kuhn-Tucker conditions.) 

6.3 Cards • An ordinary deck of cards containing 26 red cards and 
26 black cards is shuffled and dealt out one card at time without 
replacement. Let Xi be the color of the /th card. 

(a) Determine 

(b) Determine //(X 2 ). 

(c) Does H(Xk I X\, X 2 , ..., ^- 1 ) increase or decrease? 

(d) Determine H(X\, X 2 , … ， X 52 ). 

6.4 Gambling. Suppose that one gambles sequentially on the card 
outcomes in Problem 6.6.3. Even odds of 2-for-l are paid. Thus, 
the wealth S n at time n is S n = 2 n b(x\, X 2 ,, x n ), where 
b(x\, X 2 ,..., x n ) is the proportion of wealth bet on xi, X 2 ,..., 
Find max 办 (• ） E 1 log 5 * 52 . 

6.5 Beating the public odds. Consider a three-horse race with win 
probabilities 

(\ 1 1 

{PUP2 ， P,)=[^,- A ，- A 

and fair odds with respect to the (false) distribution 



178 


GAMBLING AND DATA COMPRESSION 


(b) Find the optimal gambling scheme b (i.e., the bet b* that max¬ 
imizes the exponent). 

(c) Assuming that b is chosen as in part (b), what distribution p 
causes S n to go to zero at the fastest rate? 

6.7 Horse race. Consider a horse race with four horses. Assume that 

each horse pays 4-for-l if it wins. Let the probabilities of win- 
ning of the horses be {^, |}. If you started with $100 and 

bet optimally to maximize your long-term growth rate, what are 
your optimal bets on each horse? Approximately how much money 
would you have after 20 races with this strategy? 

6.8 Lotto. The following analysis is a crude approximation to the 
games of Lotto conducted by various states. Assume that the player 
of the game is required to pay $1 to play and is asked to choose 
one number from a range 1 to 8. At the end of every day, the state 
lottery commission picks a number uniformly over the same range. 
The jackpot (i.e., all the money collected that day) is split among 
all the people who chose the same number as the one chosen by the 
state. For example, if 100 people played today, 10 of them chose 
the number 2, and the drawing at the end of the day picked 2, the 
$100 collected is split among the 10 people (i.e.，each person who 
picked 2 will receive $10, and the others will receive nothing). 
The general population does not choose numbers uni¬ 
formly — numbers such as 3 and 7 are supposedly lucky and are 
more popular than 4 or 8. Assume that the fraction of people choos¬ 
ing the various numbers 1, 2,… ，8 is (/i, , 2 ,… ， /s), and assume 
that n people play every day. Also assume that n is very large, so 
that any single person’s choice does not change the proportion of 
people betting on any number. 

(a) What is the optimal strategy to divide your money among 
the various possible tickets so as to maximize your long-term 
growth rate? (Ignore the fact that you cannot buy fractional 
tickets.) 

(b) What is the optimal growth rate that you can achieve in this 
game? 

(c) If C/i, / 2 ,... ， / 8 ) = (!，m 壶， H ， 古 )， and you start 
with $1, how long will it be before you become a millionaire? 

6.9 Horse race. Suppose that one is interested in maximizing the 
doubling rate for a horse race. Let pu P 2 ,... ， Pm denote the win 
probabilities of the m horses. When do the odds (o\, 02 , ..., o m ) 
yield a higher doubling rate than the odds (o\, o’ 2 , ••• ， o’ m )? 
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6.10 


6.11 


Horse race with probability estimates. 

(a) Three horses race. Their probabilities of winning are (j, |). 

The odds are 4-for-l, 3-for-l, and 3-for-l. Let W* be the opti¬ 
mal doubling rate. Suppose you believe that the probabilities 
are (|, |). If you try to maximize the doubling rate, what 

doubling rate W will you achieve? By how much has your dou¬ 
bling rate decrease due to your poor estimate of the probabilities 
(i.e., what is AW = W* - W)1 

(b) Now let the horse race be among m horses, with probabil¬ 

ities p = (pi, p 2 ,..., p m ) and odds o = (o\, 02 ,, o m ). If 
you believe the true probabilities to be g q 2 ,..., q m ), 

and try to maximize the doubling rate W, what is W* — 

Two-envelope problem • One envelope contains b dollars, the other 
2b dollars. The amount b is unknown. An envelope is selected at 
random. Let X be the amount observed in this envelope, and let Y 
be the amount in the other envelope. Adopt the strategy of switch¬ 
ing to the other envelope with probability p(x), where p(x)= 
( e -x^_ e xy Let Z be the amount that the player receives. Thus, 


(X, Y)= 


(b, 2b) 
(2b, b) 


with probability ^ 
with probability ^ 


X with probability 1 — p(x) 
Y with probability p(x). 


(6.56) 

(6.57) 


⑻ Show that E(X) = E(Y) = f. 

(b) Show that E{Y/X) = Since the expected ratio of the 

amount in the other envelope is it seems that one should 
always switch. (This is the origin of the switching paradox.) 
However, observe that E(Y) ^ E{X)E{Y/X). Thus, although 
E(Y/X) > 1, it does not follow that E(Y) > E(X). 

(c) Let J be the index of the envelope containing the maximum 
amount of money, and let J r be the index of the envelope 
chosen by the algorithm. Show that for any b, 7(7; J r ) > 0. 
Thus, the amount in the first envelope always contains some 
information about which envelope to choose. 

(d) Show that E(Z) > E(X). Thus, you can do better than always 
staying or always switching. In fact, this is true for any mono¬ 
tonic decreasing switching function p{x). By randomly switch¬ 
ing according to p(x), you are more likely to trade up than to 
trade down. 




180 GAMBLING AND DATA COMPRESSION 

6.12 Gambling• Find the horse win probabilities p\, p 2 , …， p m : 

(a) Maximizing the doubling rate W* for given fixed known odds 
01 ， 02 ， . • • ， Om. 

(b) Minimizing the doubling rate for given fixed odds o\, 02 ,..., 

Om. 

6.13 Dutch book. Consider a horse race with m = 2 horses, 

X = 1, 2 
P = b 1 

odds (for one) = 10， 30 

bets = b, l — b. 

The odds are superfair. 

(a) There is a bet b that guarantees the same payoff regardless of 
which horse wins. Such a bet is called a Dutch book. Find this 
b and the associated wealth factor S(X). 

(b) What is the maximum growth rate of the wealth for the optimal 
choice of bl Compare it to the growth rate for the Dutch book. 

6.14 Horse race. Suppose that one is interested in maximizing the 
doubling rate for a horse race. Let pu P 2 , …， Pm denote the win 
probabilities of the m horses. When do the odds {o\, 02 , ... ， o m ) 
yield a higher doubling rate than the odds (o[ ， o’ 2 , … ， o’ m )? 

6.15 Entropy of a fair horse race. Let X 〜 p(x), x = 1 ， 2, • • • ， m ， 

denote the winner of a horse race. Suppose that the odds o(x) 
are fair with respect to p(x) [i.e .， o(x) = ^j]. Let b(x) be the 
amount bet on horse x, b{x) > 0, — 1* Then the resulting 

wealth factor is S(x) = b{x)o{x), with probability p(x). 

(a) Find the expected wealth ES(X). 

(b) Find W^, the optimal growth rate of wealth. 

(c) Suppose that 

y _ J 1, Z = 1 or 2 
— 0, otherwise. 


If this side information is available before the bet, how much 
does it increase the growth rate VK*? 

(d) Find I{X\ Y). 
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6.16 Negative horse race. Consider a horse race with m horses with 
win probabilities pu P 2 , …， Pm. Here the gambler hopes that a 
given horse will lose. He places bets (bi, & 2 ,, b m ) ， Y1T=i ~ ^ 
on the horses, loses his bet bi if horse / wins, and retains the rest of 
his bets. (No odds.) Thus, S = bj, with probability p“ and 
one wishes to maximize ^ pi ln(l — bi) subject to the constraint 

= i_ 

(a) Find the growth rate optimal investment strategy b*. Do not 
constrain the bets to be positive, but do constrain the bets to 
sum to 1. (This effectively allows short selling and margin.) 

(b) What is the optimal growth rate? 

6.17 St. Petersburg paradox. Many years ago in ancient St. Petersburg 

the following gambling proposition caused great consternation. For 
an entry fee of c units, a gambler receives a payoff of 2 k units with 
probability 2_ k ， k = 1 , 2, _ 

(a) Show that the expected payoff for this game is infinite. For this 
reason, it was argued that c = 00 was a “fair” price to pay to 
play this game. Most people find this answer absurd. 

(b) Suppose that the gambler can buy a share of the game. For 
example, if he invests c/2 units in the game, he receives \ a 

share and a return X/2, where Pr(X = 2 k ) = 2~ k , k = 1, 2,_ 

Suppose that X\, X 2 , .. • are i.i.d. according to this distribution 
and that the gambler reinvests all his wealth each time. Thus, 
his wealth S n at time n is given by 

n x. 

= FT —* (6.58) 

1 \ c 

1=1 


Show that this limit is 00 or 0, with probability 1, accordingly 
as c < c* or c > c*. Identify the “fair” entry fee c*. 

More realistically, the gambler should be allowed to keep a pro¬ 
portion = 1 — Z? of his money in his pocket and invest the rest 
in the St. Petersburg game. His wealth at time n is then 

=fl(^ + T 1 ) - (6-59) 


Let 


00 / b2 k \ 

W(b, c) = log I 1 - & + — ). 

k=l \ 


(6.60) 


c 
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We have 



(6.61) 


Let 


W*(c) = max W(b, c). 


(6.62) 


0< Z ?<1 


Here are some questions about W^(c). 

(a) For what value of the entry fee c does the optimizing value Z?* 
drop below 1? 

(b) How does Z?* vary with c? 

(c) How does VK*(c) fall off with c? 

Note that since VK*(c) > 0, for all c, we can conclude that any 
entry fee c is fair. 

6.18 Super St. Petersburg. Finally, we have the super St. Peters¬ 
burg paradox, where Pr(X = 2 2 々 ）= 2- k , k = 1,2,_Here the 

expected log wealth is infinite for all Z? > 0, for all c, and the 
gambler’s wealth grows to infinity faster than exponentially for 
any b > 0. But that doesn’t mean that all investment ratios b are 
equally good. To see this, we wish to maximize the relative growth 
rate with respect to some other portfolio, say, b = (^, ^). Show 
that there exists a unique b maximizing 


b + bX/c 


E\n- 



and interpret the answer. 

HISTORICAL NOTES 

The original treatment of gambling on a horse race is due to Kelly [308], 
who found that AW = I. Log-optimal portfolios go back to the work 
of Bernoulli, Kelly [308], Latane [346], and Latane and Tuttle [347]. 
Proportional gambling is sometimes referred to as the Kelly gambling 
scheme. The improvement in the probability of winning by switching 
envelopes in Problem 6.11 is based on Cover [130]. 

Shannon studied stochastic models for English in his original paper 
[472]. His guessing game for estimating the entropy rate of English is 
described in [482]. Cover and King [131] provide a gambling estimate 
for the entropy of English. The analysis of the St. Petersburg paradox 
is from Bell and Cover [39]. An alternative analysis can be found in 
Feller [208]. 
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CHANNEL CAPACITY 


What do we mean when we say that A communicates with B1 We mean 
that the physical acts of A have induced a desired physical state in B. This 
transfer of information is a physical process and therefore is subject to the 
uncontrollable ambient noise and imperfections of the physical signaling 
process itself. The communication is successful if the receiver B and the 
transmitter A agree on what was sent. 

In this chapter we find the maximum number of distinguishable signals 
for n uses of a communication channel. This number grows exponen¬ 
tially with n, and the exponent is known as the channel capacity. The 
characterization of the channel capacity (the logarithm of the number of 
distinguishable signals) as the maximum mutual information is the central 
and most famous success of information theory. 

The mathematical analog of a physical signaling system is shown 
in Figure 7.1. Source symbols from some finite alphabet are mapped 
into some sequence of channel symbols, which then produces the out¬ 
put sequence of the channel. The output sequence is random but has a 
distribution that depends on the input sequence. From the output sequence, 
we attempt to recover the transmitted message. 

Each of the possible input sequences induces a probability distribution 
on the output sequences. Since two different input sequences may give rise 
to the same output sequence, the inputs are confusable. In the next few 
sections, we show that we can choose a “nonconfusable” subset of input 
sequences so that with high probability there is only one highly likely input 
that could have caused the particular output. We can then reconstruct the 
input sequences at the output with a negligible probability of error. By 
mapping the source into the appropriate “widely spaced” input sequences 
to the channel, we can transmit a message with very low probability of 
error and reconstruct the source message at the output. The maximum rate 
at which this can be done is called the capacity of the channel. 

Definition We define a discrete channel to be a system consisting of an 
input alphabet X and output alphabet y and a probability transition matrix 


Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A. Thomas 
Copyright © 2006 John Wiley & Sons, Inc. 


183 




184 


CHANNEL CAPACITY 


w 


X n 

Channel 

Y n } 

Decoder 

W 

Message 



p(y\x) 


Estimate 

of 


Message 


FIGURE 7.1. Communication system. 


p(y\x) that expresses the probability of observing the output symbol y 
given that we send the symbol x. The channel is said to be memoryless 
if the probability distribution of the output depends only on the input at 
that time and is conditionally independent of previous channel inputs or 
outputs. 

Definition We define the “information” channel capacity of a discrete 
memory less channel as 

C = max/(X; r )， (7.1) 

P(x) 

where the maximum is taken over all possible input distributions p{x). 

We shall soon give an operational definition of channel capacity as the 
highest rate in bits per channel use at which information can be sent with 
arbitrarily low probability of error. Shannon’s second theorem establishes 
that the information channel capacity is equal to the operational channel 
capacity. Thus, we drop the word information in most discussions of 
channel capacity. 

There is a duality between the problems of data compression and data 
transmission. During compression, we remove all the redundancy in the 
data to form the most compressed version possible, whereas during data 
transmission, we add redundancy in a controlled fashion to combat errors 
in the channel. In Section 7.13 we show that a general communication 
system can be broken into two parts and that the problems of data com¬ 
pression and data transmission can be considered separately. 

7.1 EXAMPLES OF CHANNEL CAPACITY 

7.1.1 Noiseless Binary Channel 

Suppose that we have a channel whose the binary input is reproduced 
exactly at the output (Figure 7.2). 

In this case, any transmitted bit is received without error. Hence, one 
error-free bit can be transmitted per use of the channel, and the capacity is 
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0 - 0 
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1 - 1 

FIGURE 7.2. Noiseless binary channel. C = l bit. 

1 bit. We can also calculate the information capacity C = max I(X; Y)= 
1 bit, which is achieved by using p(x)= ( 去 ， |). 

7.1.2 Noisy Channel with Nonoverlapping Outputs 

This channel has two possible outputs corresponding to each of the two 
inputs (Figure 7.3). The channel appears to be noisy, but really is not. 
Even though the output of the channel is a random consequence of the 
input, the input can be determined from the output, and hence every trans¬ 
mitted bit can be recovered without error. The capacity of this channel is 
also 1 bit per transmission. We can also calculate the information capacity 
C = max I (X ; Y) = l bit, which is achieved by using p(x) = (| ， i). 



FIGURE 7.3. Noisy channel with nonoverlapping outputs. C = l bit. 
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7.1.3 Noisy Typewriter 


In this case the channel input is either received unchanged at the 
output with probability j or is transformed into the next letter with 
probability ^ (Figure 7.4). If the input has 26 symbols and we 
use every alternate input symbol, we can transmit one of 13 sym¬ 
bols without error with each transmission. Hence, the capacity of 
this channel is log 13 bits per transmission. We can also calculate 
the information capacity C = max /(Z; Y) = max (H(Y) — H{Y\X))= 
max H{Y) — 1 = log26 — 1 = log 13, achieved by using p{x) distributed 
uniformly over all the inputs. 
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FIGURE 7.4. Noisy Typewriter. C = log 13 bits. 
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7.1.4 Binary Symmetric Channel 

Consider the binary symmetric channel (BSC), which is shown in Fig. 7.5. 
This is a binary channel in which the input symbols are complemented 
with probability p. This is the simplest model of a channel with errors, 
yet it captures most of the complexity of the general problem. 

When an error occurs, a 0 is received as a 1, and vice versa. The bits 
received do not reveal where the errors have occurred. In a sense, all 
the bits received are unreliable. Later we show that we can still use such 
a communication channel to send information at a nonzero rate with an 
arbitrarily small probability of error. 

We bound the mutual information by 


/(X; Y) = H(Y) - H(Y\X) 

(7.2) 

= h(Y)-J2pM h ( y \ x = x) 

(7.3) 

= H(Y)-Y / p(x)H(p) 

(7.4) 

=H{Y)- H{p) 

(7.5) 

< 1 - H(p), 

(7.6) 


where the last inequality follows because F is a binary random variable. 
Equality is achieved when the input distribution is uniform. Hence, the 
information capacity of a binary symmetric channel with parameter p is 

C = 1 - H(p) bits. (7.7) 


1 -p 



FIGURE 7.5. Binary symmetric channel. C = l — H(p) bits. 
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7.1.5 Binary Erasure Channel 

The analog of the binary symmetric channel in which some bits are lost 
(rather than corrupted) is the binary erasure channel. In this channel, a 
fraction a of the bits are erased. The receiver knows which bits have 
been erased. The binary erasure channel has two inputs and three outputs 
(Figure 7.6). 

We calculate the capacity of the binary erasure channel as follows: 


C = max/(Z; Y) 

p ⑴ 

(7.8) 

= max(//(F) - H{Y\X)) 

p{x) 

(7.9) 

=max//(7) - H(a). 

p{x) 

(7.10) 


The first guess for the maximum of H(Y) would be log 3, but we cannot 
achieve this by any choice of input distribution p(x). Letting E be the 
event [Y = ej, using the expansion 

H{Y) = H(Y,E) = H(E) + H(Y\E), (7.11) 

and letting Pr(Z = 1)= 丌 ， we have 

H(Y) = H((l - 7r)(l-a),a, jr(l-a)) = H(a) + (1 - 

(7.12) 



FIGURE 7.6. Binary erasure channel. 




Hence 
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max//(F) - H(a) 

(7.13) 

p(x) 


max(l - a)H(n) + H(a) - H(a) 

71 

(7.14) 

max(l — a)H(n) 

71 

(7.15) 

1 _ 0?, 

(7.16) 


where capacity is achieved by 丌 = 

The expression for the capacity has some intuitive meaning: Since a 
proportion a of the bits are lost in the channel, we can recover (at most) 
a proportion l — a of the bits. Hence the capacity is at most 1 — a. It is 
not immediately obvious that it is possible to achieve this rate. This will 
follow from Shannon’s second theorem. 

In many practical channels, the sender receives some feedback from 
the receiver. If feedback is available for the binary erasure channel, it is 
very clear what to do: If a bit is lost, retransmit it until it gets through. 
Since the bits get through with probability l — a, the effective rate of 
transmission is l — a. In this way we are easily able to achieve a capacity 
of 1 — a with feedback. 

Later in the chapter we prove that the rate 1 — a is the best that can be 
achieved both with and without feedback. This is one of the consequences 
of the surprising fact that feedback does not increase the capacity of 
discrete memory less channels. 

7.2 SYMMETRIC CHANNELS 


The capacity of the binary symmetric channel is C = 1 — H(p) bits per 
transmission, and the capacity of the binary erasure channel is C = 1 — 
a bits per transmission. Now consider the channel with transition matrix: 


p{y\x) 


(7.17) 


Here the entry in the xth row and the yth column denotes the conditional 
probability p{y\x) that y is received when x is sent. In this channel, all 
the rows of the probability transition matrix are permutations of each other 
and so are the columns. Such a channel is said to be symmetric. Another 
example of a symmetric channel is one of the form 


. 5 . 2.3 

o.o.o. 

. 2 . 3.5 

o.o.o. 

. 3 . 5.2 

o.o.o. 


Y = X + Z (mod c), 


(7.18) 
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where Z has some distribution on the integers {0, 1, 2,..., c — 1}, X has 
the same alphabet as Z, and Z is independent of X. 

In both these cases, we can easily find an explicit expression for the 
capacity of the channel. Letting r be a row of the transition matrix, we 
have 


I(X; Y) = H{Y) - H{Y\X) (7.19) 

= H(Y) - H(r) (7.20) 

< log |^| - H(r) (7.21) 


with equality if the output distribution is uniform. But p(x) = l/|Af| 
achieves a uniform distribution on Y, as seen from 


p(y) = p^y\ x ^pM = 

xeX 


|A1 


!>(*) 


—— Q - - 

_ 1^1 ~ 13^1 ' 


(7.22) 


where c is the sum of the entries in one column of the probability transition 
matrix. 

Thus, the channel in (7.17) has the capacity 


C = max/(Z; Y) = log 3 - //(0.5, 0.3, 0.2), (7.23) 

p ⑻ 

and C is achieved by a uniform distribution on the input. 

The transition matrix of the symmetric channel defined above is doubly 
stochastic. In the computation of the capacity, we used the facts that the 
rows were permutations of one another and that all the column sums were 
equal. 

Considering these properties, we can define a generalization of the 
concept of a symmetric channel as follows: 


Definition A channel is said to be symmetric if the rows of the channel 
transition matrix p{y\x) are permutations of each other and the columns 
are permutations of each other. A channel is said to be weakly symmetric 
if every row of the transition matrix p(-\x) is a permutation of every other 
row and all the column sums p{y\x) are equal. 

For example, the channel with transition matrix 


p(y\x )= 



l i \ 

6 2 1 

1 I 

2 6 / 


(7.24) 


is weakly symmetric but not symmetric. 
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The above derivation for symmetric channels carries over to weakly 
symmetric channels as well. We have the following theorem for weakly 
symmetric channels: 

Theorem 7.2.1 For a weakly symmetric channel ， 

C = log |^V| — H{row of transition matrix), (7.25) 

and this is achieved by a uniform distribution on the input alphabet. 

7.3 PROPERTIES OF CHANNEL CAPACITY 

1. C > 0 since /(X; Y) > 0. 

2. C < log|Al since C = max/(Z; Y) < max H{X) = \og\?C\. 

3. C < log |3^| for the same reason. 

4. /(X; y) is a continuous function of p{x). 

5. J(X; Y) is a concave function of p(x) (Theorem 2.7.4). Since 
/(X; Y) is a concave function over a closed convex set, a local 
maximum is a global maximum. From properties 2 and 3, the maxi¬ 
mum is finite, and we are justified in using the term maximum rather 
than supremum in the definition of capacity. The maximum can then 
be found by standard nonlinear optimization techniques such as gra¬ 
dient search. Some of the methods that can be used include the 
following: 

• Constrained maximization using calculus and the Kuhn-Tucker 
conditions. 

• The Frank-Wolfe gradient search algorithm. 

• An iterative algorithm developed by Arimoto [25] and Blahut 
[65]. We describe the algorithm in Section 10.8. 

In general, there is no closed-form solution for the capacity. But for 
many simple channels it is possible to calculate the capacity using prop¬ 
erties such as symmetry. Some of the examples considered earlier are of 
this form. 


7.4 PREVIEW OF THE CHANNEL CODING THEOREM 

So far, we have defined the information capacity of a discrete memory less 
channel. In the next section we prove Shannon’s second theorem, which 
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FIGURE 7.7. Channels after n uses. 


gives an operational meaning to the definition of capacity as the number 
of bits we can transmit reliably over the channel. But first we will try to 
give an intuitive idea as to why we can transmit C bits of information over 
a channel. The basic idea is that for large block lengths, every channel 
looks like the noisy typewriter channel (Figure 7.4) and the channel has a 
subset of inputs that produce essentially disjoint sequences at the output. 

For each (typical) input ^-sequence, there are approximately 2 nH(<Y ^ 
possible Y sequences, all of them equally likely (Figure 7.7). We wish 
to ensure that no two X sequences produce the same Y output sequence. 
Otherwise, we will not be able to decide which X sequence was sent. 

The total number of possible (typical) Y sequences is ^ 2 nH(Y \ This set 
has to be divided into sets of size 2 nH ( Y \ x ) corresponding to the different 
input X sequences. The total number of disjoint sets is less than or equal 
to ( 付⑺ _// ( y l x )) = 2”’( x;y ). Hence, we can send at most ^ 
distinguishable sequences of length n. 

Although the above derivation outlines an upper bound on the capacity, 
a stronger version of the above argument will be used in the next section 
to prove that this rate I is achievable with an arbitrarily low probability 
of error. 

Before we proceed to the proof of Shannon’s second theorem, we need 
a few definitions. 

7.5 DEFINITIONS 

We analyze a communication system as shown in Figure 7.8. 

A message W, drawn from the index set {1,2,..., A/}, results in the 
signal X n (W), which is received by the receiver as a random sequence 
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FIGURE 7.8. Communication channel. 


〜 p{y n \x n ). The receiver then guesses the index W by an appropriate 
decoding rule W = g(Y n ). The receiver makes an error if W is not the 
same as the index W that was transmitted. We now define these ideas 
formally. 

Definition A discrete channel, denoted by (X, p(y\x), y), consists of 
two finite sets X and y and a collection of probability mass functions 
p{y\x), one for each x e X, such that for every x and y, p(y\x) > 0, and 
for every x, p(y\x) = 1, with the interpretation that X is the input 
and Y is the output of the channel. 

Definition The nth extension of the discrete memoryless channel (DMC) 
is the channel {X n , p(y n \x n ), y n ), where 

p{yk\x k , y k ~ l 2 ) = p(jk\xk), k = 1,2,... ,n. (7.26) 

Remark If the channel is used without feedback [i.e., if the input sym¬ 
bols do not depend on the past output symbols, namely, p{xk\x k ~ x , y k ~ { ) 
=p(xk\x k ~ 1 )], the channel transition function for the nth extension of the 
discrete memoryless channel reduces to 

n 

p(y n \x n ) = Y[p(yi\ x i)- (7.27) 

i=l 

When we refer to the discrete memory less channel, we mean the discrete 
memoryless channel without feedback unless we state explicitly other¬ 
wise. 

Definition An (M, n) code for the channel {X, p(y\x), y) consists of 
the following: 

1. An index set {1, 2,, M). 

2. An encoding function : {1, 2, ..., M) X n , yielding codewords 

x n (l), x n (2), .. x n (M). The set of codewords is called the code¬ 
book. 
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3. A decoding function 

{1 ， 2 ， ... ， M }， (7.28) 

which is a deterministic rule that assigns a guess to each possible 
received vector. 

Definition (Conditional probability of error) Let 
入， =Pr(g(F n ) ^ i\X n = x n (i)) = J^p(y n \x n (i))ng(y n )^i) (7.29) 

be the conditional probability of error given that index i was sent, where 
/(•) is the indicator function. 

Definition The maximal probability of error 入 ( w ) for an (M, n) code is 
defined as 

入 ⑷ =max ki. (7.30) 

Definition The (arithmetic) average probability of error for an 

(M, n) code is defined as 

M 

( 7 . 31 ) 

i=l 

Note that if the index W is chosen according to a uniform distribution 
over the set {1,2,..., M}, and X n = x n (W), then 

P^ n) = Pr(W # g(Y n )), (7.32) 

(i.e., Pe l) is the probability of error). Also, obviously, 

P^ n) < 入 ⑻. (7.33) 

One would expect the maximal probability of error to behave quite differ¬ 

ently from the average probability. But in the next section we prove that 
a small average probability of error implies a small maximal probability 
of error at essentially the same rate. 
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It is worth noting that P 》 n 、defined in (7.32) is only a mathematical 
construct of the conditional probabilities of error 入 / and is itself a proba¬ 
bility of error only if the message is chosen uniformly over the message 
set {1,2,..., 2 m ). However, both in the proof of achievability and the 
converse, we choose a uniform distribution on W to bound the probability 
of error. This allows us to establish the behavior of ⑻ and the maximal 
probability of error 入⑻ and thus characterize the behavior of the channel 
regardless of how it is used (i.e., no matter what the distribution of W). 

Definition The rate R of an (M, n) code is 

log M 

R = - bits per transmission. (7.34) 

n 

Definition A rate R is said to be achievable if there exists a sequence 
of , n) codes such that the maximal probability of error 入 (w ) tends 

to 0 as n —> oc. 

Later, we write (2 nR , n) codes to mean , n) codes. This will 

simplify the notation. 

Definition The capacity of a channel is the supremum of all achievable 
rates. 

Thus, rates less than capacity yield arbitrarily small probability of error 
for sufficiently large block lengths. 


7.6 JOINTLY TYPICAL SEQUENCES 


Roughly speaking, we decode a channel output Y n as the /th index if 
the codeword X n {i) is “jointly typical” with the received signal Y n . We 
now define the important idea of joint typicality and find the probabil¬ 
ity of joint typicality when X n {i) is the true cause of Y n and when it 
is not. 


Definition The set of jointly typical sequences {(x n , with 
respect to the distribution p(x, y) is the set of /i-sequences with empirical 
entropies 6-close to the true entropies: 


Af) = {{x n ,y n ) eX n x y n 


— \ogp{x n )-H{X) 
n 


< 


(7.35) 
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where 


—log p{y n )- H{Y) 

<6 


n 



--\ogp(x n ,y n )- H(X, Y) 

< 6 

n 




(7.36) 

(7.37) 


n 

p{x n , y n ) = ]^[ p{Xi, yi). (7.38) 



Theorem 7.6.1 (Joint AEP) Let (X n , Y n ) be sequences of length n 
drawn i.i.d. according to p(x n ， y n ) = P( x i, 3^). Then: 

1. Pr((X n , Y n ) G A^) ^ \ as n ^ oo. 

2. ⑻ I < 2 n(<H(<x,Y ^ +€ \ 

3. If (Jt n ， f n 、〜 p{x n )p{y n ) [i.e. y X n and Y n are independent with the 
same marginals as p(x n , y n )]，then 

Pr Y n ) e Af ^ < 2 _n(/(Z；y)_3€) . (7.39) 

Also, for sufficiently large n ， 

Pr {{X n , Y n ) g A ( e n) ^j > (1 - e)2 _n(/(x；7)+3f) . (7.40) 

Proof 


1. We begin by showing that with high probability, the sequence is in 
the typical set. By the weak law of large numbers, 


- log p{X n ) — £[log p(X)] = H{X) in probability. 

n 

(7.41) 

Hence, given ^ > 0, there exists n\, such that for all n > n\, 


Pr I 


--log p(X n ) - H(X) 
n 


> 6 < 


(7.42) 


Similarly, by the weak law, 


--log p(Y n ) -E[log p(Y)] = H{Y) in probability (7.43) 









and 
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--log p(X n ， Y n )4 -E[log p(X, Y)] = H(X,Y) in probability, 
n 

(7.44) 

and there exist n 2 and such that for all n > ^ 2 , 


Pr 


~-\og p{Y n ) - H(Y) 
n 



(7.45) 


and for all n > ^ 3 , 

--log p(X n ,Y n )~ H(X, Y) >e]< e -. (7.46) 

n ) 3 

Choosing n > max{ni, n. 2 , ^ 3 }, the probability of the union of the 
sets in (7.42), (7.45), and (7.46) must be less than e. Hence for n 
sufficiently large, the probability of the set is greater than 1 — 6 , 
establishing the first part of the theorem. 

2. To prove the second part of the theorem, we have 



= J^p(x n ,y n ) 

(7.47) 

>J^p(x n ,y n ) 

(7.48) 

Ai n) 


> \A ( J l) \2~ n(H(x ' Y)+€) , 

(7.49) 


and hence 

| A («)| < 2 n(H(X,7)+€)_ (7.50) 

3. Now if X n and Y n are independent but have the same marginals as 
X n and Y n , then 


Pr((X n , Y n ) g A^) = J2 P ( x n ) p ( y n ) 

(x n ,y n )eAi n) 

< 2 n ( H ( x ^)+^2~ n(H(X) ~ e) 2~ n{H{Y) ~ €) 


(7.51) 

(7.52) 

(7.53) 
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For sufficiently large n, Pr(A^ w) ) >1 — 6, and therefore 
1 - e < p(x n ,y n ) 

(x n ,y n )eAi n) 

< \A^ l) \2~ n{HiX ^ Y) ~ €) 

and 

\A^>{l-e)2 n ^ x ^\ 

By similar arguments to the upper bound above, we can also show 
that for n sufficiently large, 

Pr((X n , Y n ) G A< n) ) - p{x n )p{y n ) (7.57) 

Ai n) 

> (1 _ € ) 2 n{H{XJ) ~ e) 2 ~ n{H{x)+€) 2 ~ n(H{Y)+e) 

(7.58) 

=(1 — e)2- ,,(7(Z；y)+3e) . □ (7.59) 

The jointly typical set is illustrated in Figure 7.9. There are about 
2 nH ^ typical X sequences and about 2 nH(<Y>) typical Y sequences. How¬ 
ever, since there are only 2 nH( ' X,Y ^ jointly typical sequences, not all pairs 
of typical X n and typical Y n are also jointly typical. The probability that 


(7.54) 

(7.55) 

(7.56) 



FIGURE 7.9. Jointly typical sequences. 
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any randomly chosen pair is jointly typical is about 2 -nI ( x;Y ). Hence, 
we can consider about 2 nI 、 x ’， Y 、such pairs before we are likely to come 
across a jointly typical pair. This suggests that there are about 
distinguishable signals X n . 

Another way to look at this is in terms of the set of jointly typical 
sequences for a fixed output sequence Y n , presumably the output sequence 
resulting from the true input signal X n . For this sequence Y n , there are 
about 2 w// ( x l r ) conditionally typical input signals. The probability that 
some randomly chosen (other) input signal X n is jointly typical with Y n 
is about 2 nH( ' X ^/2 nH ^ =2—This again suggests that we can 
choose about 2 n/(x；y) codewords X n {W) before one of these codewords 
will get confused with the codeword that caused the output Y n . 


7.7 CHANNEL CODING THEOREM 

We now prove what is perhaps the basic theorem of information theory, 
the achievability of channel capacity, first stated and essentially proved 
by Shannon in his original 1948 paper. The result is rather counterintu¬ 
itive; if the channel introduces errors, how can one correct them all? Any 
correction process is also subject to error, ad infinitum. 

Shannon used a number of new ideas to prove that information can be 
sent reliably over a channel at all rates up to the channel capacity. These 
ideas include: 

• Allowing an arbitrarily small but nonzero probability of error 

• Using the channel many times in succession, so that the law of large 
numbers comes into effect 

• Calculating the average of the probability of error over a random 
choice of codebooks, which symmetrizes the probability, and which 
can then be used to show the existence of at least one good code 

Shannon’s outline of the proof was based on the idea of typical sequen¬ 
ces, but the proof was not made rigorous until much later. The proof given 
below makes use of the properties of typical sequences and is probably 
the simplest of the proofs developed so far. As in all the proofs, we 
use the same essential ideas — random code selection, calculation of the 
average probability of error for a random choice of codewords, and so 
on. The main difference is in the decoding rule. In the proof, we decode 
by joint typicality; we look for a codeword that is jointly typical with the 
received sequence. If we find a unique codeword satisfying this property, 
we declare that word to be the transmitted codeword. By the properties 
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of joint typicality stated previously, with high probability the transmitted 
codeword and the received sequence are jointly typical, since they are 
probabilistically related. Also, the probability that any other codeword 
looks jointly typical with the received sequence is 2~ nI . Hence, if we 
have fewer then 2 nI codewords, then with high probability there will be 
no other codewords that can be confused with the transmitted codeword, 
and the probability of error is small. 

Although jointly typical decoding is suboptimal, it is simple to analyze 
and still achieves all rates below capacity. 

We now give the complete statement and proof of Shannon’s second 
theorem: 

Theorem 7.7.1 (Channel coding theorem) For a discrete memory¬ 
less channel，all rates below capacity C are achievable. Specifically，for 
every rate R < C, there exists a sequence of(2 nR , n) codes with maximum 
probability of error 入 ( n ) — > 0. 

Conversely, any sequence of {2 nR , n) codes with 入 ( w ) 0 must have 

R<C. 

Proof: We prove that rates /? < C are achievable and postpone proof of 
the converse to Section 7.9. 

Achievability'. Fix p{x). Generate a (2 nR , n) code at random according 
to the distribution p(x). Specifically, we generate 2 nR codewords inde¬ 
pendently according to the distribution 

n 

p(x n ) = 「 [ 尸 ⑹. 

i = \ 

We exhibit the 2 nR codewords as the rows of a matrix: 

"^i(l) 巧⑴ … 知⑴ 

C = ; : : 

_ xi(2 nR ) x 2 (2 nR ) ••• x n (2 nR ) 

Each entry in this matrix is generated i.i.d. according to p{x). Thus, the 
probability that we generate a particular code C is 

2 nR n 

Pr(C)= nn p(Xi(w)). (7.62) 

w=l i=l 


(7.60) 


(7.61) 
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Consider the following sequence of events: 

1. A random code C is generated as described in (7.62) according to 
p(x). 

2. The code C is then revealed to both sender and receiver. Both sender 
and receiver are also assumed to know the channel transition matrix 
p(y\x) for the channel. 

3. A message W is chosen according to a uniform distribution 

Pr(W = w) = 2~ nR , w = l,2,...,2 nR . (7.63) 

4. The wth codeword X n (w), corresponding to the wth row of C, is 
sent over the channel. 

5. The receiver receives a sequence Y n according to the distribution 

n 

P{y n \x n {w)) = (7.64) 

i=\ 

6. The receiver guesses which message was sent. (The optimum proce¬ 
dure to minimize probability of error is maximum likelihood decod¬ 
ing (i.e., the receiver should choose the a posteriori most likely 
message). But this procedure is difficult to analyze. Instead, we will 
use jointly typical decoding, which is described below. Jointly typi¬ 
cal decoding is easier to analyze and is asymptotically optimal.) In 
jointly typical decoding, the receiver declares that the index W was 
sent if the following conditions are satisfied: 

• (X n (W), Y n ) is jointly typical. 

• There is no other index W r such that (X n (W f ), Y n ) e 

A^ n \ 

If no such W exists or if there is more than one such, an error is 
declared. (We may assume that the receiver outputs a dummy index 
such as 0 in this case.) 

7. There is a decoding error if W ^ W. Let £ be the event {W ^ W}. 
Analysis of the probability of error 

Outline: We first outline the analysis. Instead of calculating the proba¬ 
bility of error for a single code, we calculate the average over all codes 
generated at random according to the distribution (7.62). By the symmetry 
of the code construction, the average probability of error does not depend 
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on the particular index that was sent. For a typical codeword, there are two 
different sources of error when we use jointly typical decoding: Either the 
output Y n is not jointly typical with the transmitted codeword or there is 
some other codeword that is jointly typical with Y n . The probability that 
the transmitted codeword and the received sequence are jointly typical 
goes to 1, as shown by the joint AEP. For any rival codeword, the proba¬ 
bility that it is jointly typical with the received sequence is approximately 
2~ nI , and hence we can use about 2 nI codewords and still have a low 
probability of error. We will later extend the argument to find a code with 
a low maximal probability of error. 

Detailed calculation of the probability of error: We let W be drawn 
according to a uniform distribution over {1,2,..., 2 nR } and use jointly 
typical decoding W{y n ) as described in step 6. Let £ = {W(Y n ) ^ W} 
denote the error event. We will calculate the average probability of error, 
averaged over all codewords in the codebook, and averaged over all code¬ 
books; that is, we calculate 


Pr(f) = J]Pr(C)P e (n) (C) 
c 

(7.65) 

1 2 nR 


= XI Pr ( c )w XI ‘( c ) 

(7.66) 

C w= 1 


1 2 nR 


=w X] E Pr ⑹入 〆 C )， 
w=l C 

(7.67) 


where Pe ， l \C) is defined for jointly typical decoding. By the symmetry 
of the code construction, the average probability of error averaged over 
all codes does not depend on the particular index that was sent [i.e., 
Pr(C)X w (C) does not depend on w]. Thus, we can assume without 
loss of generality that the message W = l was sent, since 

1 t r 

Pr ⑹ = (7.68) 

w=l C 

= J^PT(C)ldC) (7.69) 

c 

= Pr(£\W = 1). (7.70) 

Define the following events: 

Ei = { (X n (i), Y n ) is in A^}, ie{l,2 . 2 nR }, (7.71) 
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where E[ is the event that the /th codeword and Y n are jointly typical. 
Recall that Y n is the result of sending the first codeword X n {\) over the 
channel. 

Then an error occurs in the decoding scheme if either E c x occurs (when 
the transmitted codeword and the received sequence are not jointly typical) 
or £2 U 五 3 U •. • U E 2 nR occurs (when a wrong codeword is jointly typical 
with the received sequence). Hence, letting P(£) denote Pr(f | W = 1), we 
have 

Pr(£\W = 1) = P U 五 2 U £ 3 U .. • U E 2 n R \W = l) (7.72) 

2 nR 

< P{E[\W = 1) += 1), (7.73) 

i=2 

by the union of events bound for probabilities. Now, by the joint AEP, 
P{E C X \W =1) — 0, and hence 

P{E C X \W = l) < € for n sufficiently large. (7.74) 

Since by the code generation process, X n (l) and X n (i) are independent 
for / ^ 1, so are Y n and X n {i). Hence, the probability that X n {i) and F" 
are jointly typical is < 2 _w(/(X；y ^ _3€) by the joint AEP. Consequently, 




Vx{£)=Vr{8\W = 1) < P{E[\W = 1) + ^P{Ei\W = 

i=2 

1) (7.75) 

2 n ^ 

< $ + 〉: 2 - - 3 € ) 

/ =2 

(7.76) 

= 6 + (2 nR - l) 2 _w(/(x；y)_3e) 

(7.77) 

< 6 + 2 3 ^2 _n(/(z；y)_jR) 

(7.78) 

<26 

(7.79) 


if n is sufficiently large and R < I{X\ Y) — 3^. Hence, if 7? < I{X\ Y), 
we can choose 6 and n so that the average probability of error, averaged 
over codebooks and codewords, is less than 2e. 

To finish the proof, we will strengthen this conclusion by a series of 
code selections. 

1. Choose p(x) in the proof to be p*(x), the distribution on X that 
achieves capacity. Then the condition R < I (X, Y) can be replaced 
by the achievability condition R < C. 


204 


CHANNEL CAPACITY 


2. Get rid of the average over codebooks. Since the average proba¬ 
bility of error over codebooks is small (< 2e), there exists at least 
one codebook C* with a small average probability of error. Thus, 
Pr(£\C*) < 26. Determination of C* can be achieved by an exhaus¬ 
tive search over all (2 nR , n) codes. Note that 

. 2 nR 

Pr(£\C*) = (7.80) 

since we have chosen W according to a uniform distribution as 
specified in (7.63). 

3. Throw away the worst half of the codewords in the best codebook 
C*. Since the arithmetic average probability of error ⑻ (C*) for 
this code is less then 2e, we have 

M£\C*) 竺 2e ， (7-81) 

which implies that at least half the indices i and their associated 
codewords X n (i) must have conditional probability of error 入 / less 
than 46 (otherwise, these codewords themselves would contribute 
more than 2e to the sum). Hence the best half of the codewords 
have a maximal probability of error less than 46. If we reindex these 
codewords, we have 2 nR ~ { codewords. Throwing out half the code¬ 
words has changed the rate from R to R — ^， which is negligible 
for large n. 

Combining all these improvements, we have constructed a code of rate 
R’ = R _ with maximal probability of error 入 ( n ) $ 4e. This proves the 
achievability of any rate below capacity. □ 

Random coding is the method of proof for Theorem 7.7.1, not the 
method of signaling. Codes are selected at random in the proof merely to 
symmetrize the mathematics and to show the existence of a good deter¬ 
ministic code. We proved that the average over all codes of block length 
n has a small probability of error. We can find the best code within this 
set by an exhaustive search. Incidentally, this shows that the Kolmogorov 
complexity (Chapter 14) of the best code is a small constant. This means 
that the revelation (in step 2) to the sender and receiver of the best code 
C* requires no channel. The sender and receiver merely agree to use the 
best (2 nR , n) code for the channel. 

Although the theorem shows that there exist good codes with arbitrar¬ 
ily small probability of error for long block lengths, it does not provide 
a way of constructing the best codes. If we used the scheme suggested 
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by the proof and generate a code at random with the appropriate distri¬ 
bution, the code constructed is likely to be good for long block lengths. 
However, without some structure in the code, it is very difficult to decode 
(the simple scheme of table lookup requires an exponentially large table). 
Hence the theorem does not provide a practical coding scheme. Ever 
since Shannon’s original paper on information theory, researchers have 
tried to develop structured codes that are easy to encode and decode. 
In Section 7.11, we discuss Hamming codes, the simplest of a class of 
algebraic error correcting codes that can correct one error in a block of 
bits. Since Shannon’s paper, a variety of techniques have been used to 
construct error correcting codes, and with turbo codes have come close 
to achieving capacity for Gaussian channels. 

7.8 ZERO-ERROR CODES 

The outline of the proof of the converse is most clearly motivated by 
going through the argument when absolutely no errors are allowed. We 
will now prove that = 0 implies that R < C. Assume that we have a 
(2 nR , n) code with zero probability of error [i.e., the decoder output g{Y n ) 
is equal to the input index W with probability 1]. Then the input index W 
is determined by the output sequence [i.e., H(W\Y n ) =0]. Now, to obtain 
a strong bound, we arbitrarily assume that W is uniformly distributed 
over {1, 2, ... ， 2 nR }. Thus, H{W) = nR. We can now write the string of 
inequalities: 

nR = H{W) = H(W\Y n ) +I(W\ Y n ) (7.82) 

^ - V - 1 

=0 

= I{W\ Y n ) (7.83) 

(a) 

< I(X n ; Y n ) (7.84) 

(b) 

< ^/(X/；^) (7.85) 

i=\ 

(C) / 

< nC, (7.86) 

where (a) follows from the data-processing inequality (since W X n (W) 
—> Y n forms a Markov chain), (b) will be proved in Lemma 7.9.2 using 
the discrete memory less assumption, and (c) follows from the definition 
of (information) capacity. Hence, for any zero-error (2 nR , n) code, for 
all n. 


R<C. 


( 7 . 87 ) 
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7.9 FANO，S INEQUALITY AND THE CONVERSE 
TO THE CODING THEOREM 

We now extend the proof that was derived for zero-error codes to the case 
of codes with very small probabilities of error. The new ingredient will be 
Fano’s inequality, which gives a lower bound on the probability of error 
in terms of the conditional entropy. Recall the proof of Fano’s inequality, 
which is repeated here in a new context for reference. 

Let us define the setup under consideration. The index W is uniformly 
distributed on the set W = {1, 2,… ， 2 nR }, and the sequence Y n is related 
probabilistically to W. From Y n , we estimate the index W that was sent. 
Let the estimate be W = g{Y n ). Thus, W X n {W) Y n ^ W forms 
a Markov chain. Note that the probability of error is 

Vr (Wi ， w) = ^Y, X ' = P e n) - ( 7 - 88 ) 

i 

We begin with the following lemma, which has been proved in 
Section 2.10: 

Lemma 7.9.1 (Fano’s inequality) For a discrete memoryless channel 
with a codebook C and the input message W uniformly distributed over 
2 nR , we have 

H{W\W) < 1 + P e ( n )nR. (7.89) 

Proof: Since W is uniformly distributed, we have = Pr(VK ^ W). 
We apply Fano’s inequality (Theorem 2.10.1) for W in an alphabet of 

size 2 nR . □ 

We will now prove a lemma which shows that the capacity per trans¬ 
mission is not increased if we use a discrete memoryless channel many 
times. 


Lemma 7.9.2 Let Y n be the result of passing X n through 
memoryless channel of capacity C. Then 

a discrete 

l{X n \ Y n ) < nC for all p(x n ). 

(7.90) 

Proof 


I{X n \ Y n ) = H(Y n ) - H{Y n \X n ) 

(7.91) 

n 

=H{Y n ) Yi_ u X n ) 

i=l 

(7.92) 

n 

= H{Y n )-Y J H(Y i \X i ), 

(7.93) 
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since by the definition of a discrete memoryless channel, F/ depends only 
on X[ and is conditionally independent of everything else. Continuing the 
series of inequalities, we have 


I(X n ; Y n ) = H{Y n ) - H ( Y i\ x i) 

i=\ 

(7.94) 

n n 

i=l i=l 

(7.95) 

n 

= 

i = l 

(7.96) 

< nC, 

(7.97) 


where (7.95) follows from the fact that the entropy of a collection of ran¬ 
dom variables is less than the sum of their individual entropies, and (7.97) 
follows from the definition of capacity. Thus, we have proved that using the 
channel many times does not increase the information capacity in bits per 
transmission. □ 

We are now in a position to prove the converse to the channel coding 
theorem. 

Proof: Converse to Theorem 7.7.1 (Channel coding theorem). We have 
to show that any sequence of (2 nR , n) codes with 入 ( w ) — > 0 must have R < 
C. If the maximal probability of error tends to zero, the average probability 
of error for the sequence of codes also goes to zero [i.e., 入 ⑻ 0 implies 
Pe^ —> 0, where P e ⑻ is defined in (7.32)]. For a fixed encoding rule 
X n (-) and a fixed decoding rule W = g(Y n ), we have W —> X n (W )—> 
Y n W. For each n, let W be drawn according to a uniform distribution 
over {1,2,, 2 nR }. Since W has a uniform distribution, Pr(W ^ W)= 
p e n) = E/ - Hence, 


= H(W) 

(7.98) 

=H(W\W) + I(W\ W) 

(7.99) 

< \ + P^ n) nR + I{W\ W) 

(7.100) 

■)1 + P e (n) nR + I (X n ; Y n ) 

(7.101) 

( < 1 + P^ n) nR + nC, 

(7.102) 
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where (a) follows from the assumption that W is uniform over {1,2 , …， 
2 nR }, (b) is an identity, (c) is Fano’s inequality for W taking on at most 2 nR 
values, (d) is the data-processing inequality, and (e) is from Lemma 7.9.2. 
Dividing by n, we obtain 

R < P p (n) R + - + C. (7.103) 

n 

Now letting n oc, we see that the first two terms on the right-hand 


side tend to 0, and hence 

R<C. 

(7.104) 

We can rewrite (7.103) as 



P ⑻ 
1 e 

C 1 

> 1 - . 

_ R nR 

(7.105) 


This equation shows that if R > C ， the probability of error is bounded 
away from 0 for sufficiently large n (and hence for all n, since if Pe ，]) = 0 
for small n, we can construct codes for large n with Pe = 0 by con¬ 
catenating these codes). Hence, we cannot achieve an arbitrarily low 
probability of error at rates above capacity. □ 

This converse is sometimes called the weak converse to the channel 
coding theorem. It is also possible to prove a strong converse, which states 
that for rates above capacity, the probability of error goes exponentially 
to 1. Hence, the capacity is a very clear dividing point — at rates below 
capacity, 0 exponentially, and at rates above capacity, Pi ；7) ^ 1 

exponentially. 

7.10 EQUALITY IN THE CONVERSE TO THE CHANNEL 
CODING THEOREM 

We have proved the channel coding theorem and its converse. In essence, 
these theorems state that when R < C, it is possible to send informa¬ 
tion with an arbitrarily low probability of error, and when R > C, the 
probability of error is bounded away from zero. 

It is interesting and rewarding to examine the consequences of equality 
in the converse; hopefully, it will give some ideas as to the kinds of codes 
that achieve capacity. Repeating the steps of the converse in the case when 
P e = 0, we have 


nR = H(W) 

=H{W\W) + I{W\ W) 


(7.106) 

(7.107) 
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= I(W; W)) 

(7.108) 

(a) 

< I(X n (W); Y n ) 

(7.109) 

= H{Y n ) - H{Y n \X n ) 

(7.110) 

n 

i = l 

(7.111) 

i=l i=l 

(7.112) 

n 

= 

i=\ 

(7.113) 

(C) ^ 

< nC. 

(7.114) 

We have equality in (a), the data-processing inequality, only if I(Y n ; 
X n (W)\W) = 0 and I(X n ; Y n \W) = 0, which is true if all the codewords 
are distinct and if W is a sufficient statistic for decoding. We have equality 
in (b) only if the F/’s are independent, and equality in (c) only if the 
distribution of Xi is /7*(x), the distribution on X that achieves capacity. 
We have equality in the converse only if these conditions are satisfied. This 
indicates that a capacity-achieving zero-error code has distinct codewords 
and the distribution of the F/’s must be i.i.d. with 

P*(y) = ^P*(x)p(y\x), 

(7.115) 


x 


the distribution on Y induced by the optimum distribution on X. The 
distribution referred to in the converse is the empirical distribution on X 
and Y induced by a uniform distribution over codewords, that is, 

1 t r 

P(Xi, yi) = ^ = Xi)p{yi\Xi). (7.116) 

U)=l 

We can check this result in examples of codes that achieve capacity: 

1. Noisy typewriter. In this case we have an input alphabet of 26 let¬ 
ters, and each letter is either printed out correctly or changed to the 
next letter with probability A simple code that achieves capacity 
(log 13) for this channel is to use every alternate input letter so that 
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no two letters can be confused. In this case, there are 13 codewords 
of block length 1. If we choose the codewords i.i.d. according to a 
uniform distribution on {1, 3, 5, 7, , 25}, the output of the channel 

is also i.i.d. and uniformly distributed on {1, 2,, 26}, as expected. 

2. Binary symmetric channel • Since given any input sequence, every 
possible output sequence has some positive probability, it will not 
be possible to distinguish even two codewords with zero probability 
of error. Hence the zero-error capacity of the BSC is zero. How¬ 
ever, even in this case, we can draw some useful conclusions. The 
efficient codes will still induce a distribution on Y that looks i.i.d. 
〜 Bernoulli ( 士 ) • Also, from the arguments that lead up to the con¬ 
verse, we can see that at rates close to capacity, we have almost 
entirely covered the set of possible output sequences with decoding 
sets corresponding to the codewords. At rates above capacity, the 
decoding sets begin to overlap, and the probability of error can no 
longer be made arbitrarily small. 

7.11 HAMMING CODES 

The channel coding theorem promises the existence of block codes that 
will allow us to transmit information at rates below capacity with an 
arbitrarily small probability of error if the block length is large enough. 
Ever since the appearance of Shannon’s original paper [471], people have 
searched for such codes. In addition to achieving low probabilities of 
error, useful codes should be “simple,” so that they can be encoded and 
decoded efficiently. 

The search for simple good codes has come a long way since the pub¬ 
lication of Shannon’s original paper in 1948. The entire field of coding 
theory has been developed during this search. We will not be able to 
describe the many elegant and intricate coding schemes that have been 
developed since 1948. We will only describe the simplest such scheme 
developed by Hamming [266]. It illustrates some of the basic ideas under¬ 
lying most codes. 

The object of coding is to introduce redundancy so that even if some 
of the information is lost or corrupted, it will still be possible to recover 
the message at the receiver. The most obvious coding scheme is to repeat 
information. For example, to send a 1, we send 11111, and to send a 0, we 
send 00000. This scheme uses five symbols to send 1 bit, and therefore 
has a rate of ^ bit per symbol. If this code is used on a binary symmetric 
channel, the optimum decoding scheme is to take the majority vote of 
each block of five received bits. If three or more bits are 1, we decode 
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the block as a 1; otherwise, we decode it as 0. An error occurs if and 
only if more than three of the bits are changed. By using longer repetition 
codes, we can achieve an arbitrarily low probability of error. But the rate 
of the code also goes to zero with block length, so even though the code 
is “simple,” it is really not a very useful code. 

Instead of simply repeating the bits, we can combine the bits in some 
intelligent fashion so that each extra bit checks whether there is an error in 
some subset of the information bits. A simple example of this is a parity 
check code. Starting with a block of n — 1 information bits, we choose 
the nth bit so that the parity of the entire block is 0 (the number of Ts 
in the block is even). Then if there is an odd number of errors during 
the transmission, the receiver will notice that the parity has changed and 
detect the error. This is the simplest example of an error-detecting code. 
The code does not detect an even number of errors and does not give any 
information about how to correct the errors that occur. 

We can extend the idea of parity checks to allow for more than one 
parity check bit and to allow the parity checks to depend on various subsets 
of the information bits. The Hamming code that we describe below is an 
example of a parity check code. We describe it using some simple ideas 
from linear algebra. 

To illustrate the principles of Hamming codes, we consider a binary 
code of block length 7. All operations will be done modulo 2. Consider 
the set of all nonzero binary vectors of length 3. Arrange them in columns 
to form a matrix: 


0 0 
H = 0 1 

1 0 


0 1111 
10 0 11 
10 10 1 


(7.117) 


Consider the set of vectors of length 7 in the null space of H (the vectors 
which when multiplied by H give 000). From the theory of linear spaces, 
since H has rank 3, we expect the null space of H to have dimension 4. 
These 2 4 codewords are 


0000000 0100101 
0001111 0101010 
0010110 0110011 
0011001 0111100 


1000011 1100110 
1001100 1101001 
1010101 1110000 
1011010 1111111 


Since the set of codewords is the null space of a matrix, it is linear in the 
sense that the sum of any two codewords is also a codeword. The set of 
codewords therefore forms a linear subspace of dimension 4 in the vector 
space of dimension 7. 
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Looking at the codewords, we notice that other than the all-0 codeword, 
the minimum number of Vs in any codeword is 3. This is called the 
minimum weight of the code. We can see that the minimum weight of 
a code has to be at least 3 since all the columns of H are different, so 
no two columns can add to 000. The fact that the minimum distance is 
exactly 3 can be seen from the fact that the sum of any two columns must 
be one of the columns of the matrix. 

Since the code is linear, the difference between any two codewords is 
also a codeword, and hence any two codewords differ in at least three 
places. The minimum number of places in which two codewords differ is 
called the minimum distance of the code. The minimum distance of the 
code is a measure of how far apart the codewords are and will determine 
how distinguishable the codewords will be at the output of the channel. 
The minimum distance is equal to the minimum weight for a linear code. 
We aim to develop codes that have a large minimum distance. 

For the code described above, the minimum distance is 3. Hence if a 
codeword c is corrupted in only one place, it will differ from any other 
codeword in at least two places and therefore be closer to c than to 
any other codeword. But can we discover which is the closest codeword 
without searching over all the codewords? 

The answer is yes. We can use the structure of the matrix H for decod¬ 
ing. The matrix H, called the parity check matrix, has the property that 
for every codeword c, He = 0. Let e/ be a vector with a 1 in the /th 
position and 0’s elsewhere. If the codeword is corrupted at position /, the 
received vector r = c + e/. If we multiply this vector by the matrix H, 
we obtain 

Hr = H(c + ei) = He + He t = He “ (7.118) 

which is the vector corresponding to the /th column of H. Hence looking 
at Hr, we can find which position of the vector was corrupted. Revers¬ 
ing this bit will give us a codeword. This yields a simple procedure for 
correcting one error in the received sequence. We have constructed a code- 
book with 16 codewords of block length 7, which can correct up to one 
error. This code is called a Hamming code. 

We have not yet identified a simple encoding procedure; we could use 
any mapping from a set of 16 messages into the codewords. But if we 
examine the first 4 bits of the codewords in the table, we observe that 
they cycle through all 2 4 combinations of 4 bits. Thus, we could use 
these 4 bits to be the 4 bits of the message we want to send; the other 
3 bits are then determined by the code. In general, it is possible to modify 
a linear code so that the mapping is explicit, so that the first k bits in each 
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codeword represent the message, and the last n — k bits are parity check 
bits. Such a code is called a systematic code. The code is often identified 
by its block length n, the number of information bits k and the minimum 
distance d. For example, the above code is called a (7,4,3) Hamming code 
(i.e., n =1, k = A, and d = 3). 

An easy way to see how Hamming codes work is by means of a Venn 
diagram. Consider the following Venn diagram with three circles and with 
four intersection regions as shown in Figure 7.10. To send the information 
sequence 1101, we place the 4 information bits in the four intersection 
regions as shown in the figure. We then place a parity bit in each of the 
three remaining regions so that the parity of each circle is even (i.e., there 
are an even number of Vs in each circle). Thus, the parity bits are as 
shown in Figure 7.11. 

Now assume that one of the bits is changed; for example one of the 
information bits is changed from 1 to 0 as shown in Figure 7.12. Then 
the parity constraints are violated for two of the circles (highlighted in the 
figure), and it is not hard to see that given these violations, the only single 
bit error that could have caused it is at the intersection of the two circles 
(i.e., the bit that was changed). Similarly working through the other error 
cases, it is not hard to see that this code can detect and correct any single 
bit error in the received codeword. 

We can easily generalize this procedure to construct larger matrices 
H. In general, if we use / rows in H, the code that we obtain will have 
block length n = 2 l — l, k = 2 l — l — l and minimum distance 3. All 
these codes are called Hamming codes and can correct one error. 



FIGURE 7.10. Venn diagram with information bits. 








214 


CHANNEL CAPACITY 



FIGURE 7.11. Venn diagram with information bits and parity bits with even parity for each 
circle. 



FIGURE 7.12. Venn diagram with one of the information bits changed. 


Hamming codes are the simplest examples of linear parity check codes. 
They demonstrate the principle that underlies the construction of other 
linear codes. But with large block lengths it is likely that there will be 
more than one error in the block. In the early 1950s, Reed and Solomon 
found a class of multiple error-correcting codes for nonbinary channels. 
In the late 1950s, Bose and Ray-Chaudhuri [72] and Hocquenghem [278] 
generalized the ideas of Hamming codes using Galois field theory to con¬ 
struct ，- error correcting codes (called BCH codes) for any t. Since then, 
various authors have developed other codes and also developed efficient 
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decoding algorithms for these codes. With the advent of integrated circuits, 
it has become feasible to implement fairly complex codes in hardware and 
realize some of the error-correcting performance promised by Shannon’s 
channel capacity theorem. For example, all compact disc players include 
error-correction circuitry based on two interleaved (32, 28, 5) and (28, 24, 
5) Reed-Solomon codes that allow the decoder to correct bursts of up to 
4000 errors. 

All the codes described above are block codes — they map a block of 
information bits onto a channel codeword and there is no dependence on 
past information bits. It is also possible to design codes where each output 
block depends not only on the current input block, but also on some of 
the past inputs as well. A highly structured form of such a code is called 
a convolutional code. The theory of convolutional codes has developed 
considerably over the last 40 years. We will not go into the details, but 
refer the interested reader to textbooks on coding theory [69, 356]. 

For many years, none of the known coding algorithms came close 
to achieving the promise of Shannon’s channel capacity theorem. For a 
binary symmetric channel with crossover probability p, we would need a 
code that could correct up to np errors in a block of length n and have 
— H{p)) information bits. For example, the repetition code suggested 
earlier corrects up to n/2 errors in a block of length n, but its rate goes 
to 0 with n. Until 1972, all known codes that could correct not errors for 
block length n had asymptotic rate 0. In 1972, Justesen [301] described 
a class of codes with positive asymptotic rate and positive asymptotic 
minimum distance as a fraction of the block length. 

In 1993, a paper by Berrou et al. [57] introduced the notion that the 
combination of two interleaved convolution codes with a parallel cooper¬ 
ative decoder achieved much better performance than any of the earlier 
codes. Each decoder feeds its “opinion” of the value of each bit to the 
other decoder and uses the opinion of the other decoder to help it decide 
the value of the bit. This iterative process is repeated until both decoders 
agree on the value of the bit. The surprising fact is that this iterative 
procedure allows for efficient decoding at rates close to capacity for a 
variety of channels. There has also been a renewed interest in the theory 
of low-density parity check (LDPC) codes that were introduced by Robert 
Gallager in his thesis [231, 232]. In 1997, MacKay and Neal [368] showed 
that an iterative message-passing algorithm similar to the algorithm used 
for decoding turbo codes could achieve rates close to capacity with high 
probability for LDPC codes. Both Turbo codes and LDPC codes remain 
active areas of research and have been applied to wireless and satellite 
communication channels. 
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Message 



FIGURE 7.13. Discrete memoryless channel with feedback. 


7.12 FEEDBACK CAPACITY 

A channel with feedback is illustrated in Figure 7.13. We assume that 
all the received symbols are sent back immediately and noiselessly to 
the transmitter, which can then use them to decide which symbol to send 
next. Can we do better with feedback? The surprising answer is no, which 
we shall now prove. We define a (2 nR , n) feedback code as a sequence 
of mappings Xi(W, Y l ~ l ), where each xi is a function only of the mes¬ 
sage W e 2 nR and the previous received values, Y\,Y 2 ,, F/_i, and a 
sequence of decoding functions g : y 2 ^ {1, 2,..., 2 nR }. Thus, 

P e (,,) = Pr {g(7 n ) ^ W}, (7.119) 

when W is uniformly distributed over {1,2,..., 2 nR }. 

Definition The capacity with feedback, Cfb, of a discrete memoryless 
channel is the supremum of all rates achievable by feedback codes. 

Theorem 7.12.1 {Feedback capacity) 

C FB = C = max/(X; Y). (7.120) 

P(X) 

Proof: Since a nonfeedback code is a special case of a feedback code, 
any rate that can be achieved without feedback can be achieved with 
feedback, and hence 

C FB > C. (7.121) 

Proving the inequality the other way is slightly more tricky. We cannot 
use the same proof that we used for the converse to the coding theorem 
without feedback. Lemma 7.9.2 is no longer true, since Xi depends on 
the past received symbols, and it is no longer true that F z depends only 
on Xi and is conditionally independent of the future X's in (7.93). 
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There is a simple change that will fix the problem with the proof. 
Instead of using X n ，we will use the index W and prove a similar series 
of inequalities. Let W be uniformly distributed over {1,2,..., 2 nR }. Then 


Pr(W y^W) = pi”）and 



nR = H(W) = H(W\W) + I{W\ W) 

(7.122) 

< 1 + P^ n) nR + I(W; W) 

(7.123) 

< \ + P^ l) nR + I{W\ Y n ), 

(7.124) 

by Fano’s inequality and the data-processing inequality. Now 
bound 1{W\ Y n ) as follows: 

we can 

I(W-, Y n ) = H{Y n ) - H(Y n \W) 


(7.125) 

n 

= H(Y n )-J^H(Y i \Y l ,Y 2 ,.. 

i=l 

._U) 

(7.126) 

n 

= H{Y n )~Y J H{Y i \Y u Y 2 ,.. 

i=\ 

■ Xi) 

(7.127) 

n 

= H{Y n )-Y,H{Y i \X i ), 


(7.128) 


since Xi is a function of Fi,..., F/_i and W\ and conditional on X“ Yi 
is independent of W and past samples of Y. Continuing, we have 


n 


I{W-Y n ) = H{Y n )-Y J HiY i \X i ) 

i=l 

(7.129) 

n n 

<j2 h ( y -)-J 2 h ^ x ^ 

i=l i=l 

(7.130) 

n 

i=\ 

(7.131) 

< nC 

(7.132) 

from the definition of capacity for a discrete memoryless channel. Putting 
these together, we obtain 

nR < P^ n) nR + l+nC, 

(7.133) 
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and dividing by n and letting n —> oc, we conclude that 

R<C. (7.134) 

Thus, we cannot achieve any higher rates with feedback than we can 
without feedback, and 


C FB = C. D (7.135) 

As we have seen in the example of the binary erasure channel, feedback 
can help enormously in simplifying encoding and decoding. However, it 
cannot increase the capacity of the channel. 

7.13 SOURCE-CHANNEL SEPARATION THEOREM 

It is now time to combine the two main results that we have proved so far: 
data compression (R > H: Theorem 5.4.2) and data transmission (R < 
C: Theorem 7.7.1). Is the condition H < C necessary and sufficient for 
sending a source over a channel? For example, consider sending digitized 
speech or music over a discrete memoryless channel. We could design 
a code to map the sequence of speech samples directly into the input 
of the channel, or we could compress the speech into its most efficient 
representation, then use the appropriate channel code to send it over the 
channel. It is not immediately clear that we are not losing something 
by using the two-stage method, since data compression does not depend 
on the channel and the channel coding does not depend on the source 
distribution. 

We will prove in this section that the two-stage method is as good as 
any other method of transmitting information over a noisy channel. This 
result has some important practical implications. It implies that we can 
consider the design of a communication system as a combination of two 
parts, source coding and channel coding. We can design source codes 
for the most efficient representation of the data. We can, separately and 
independently, design channel codes appropriate for the channel. The com¬ 
bination will be as efficient as anything we could design by considering 
both problems together. 

The common representation for all kinds of data uses a binary alphabet. 
Most modem communication systems are digital, and data are reduced 
to a binary representation for transmission over the common channel. 
This offers an enormous reduction in complexity. Networks like, ATM 
networks and the Internet use the common binary representation to allow 
speech, video, and digital data to use the same communication channel. 
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The result 一 that a two-stage process is as good as any one-stage pro¬ 
cess 一 seems so obvious that it may be appropriate to point out that it 
is not always true. There are examples of multiuser channels where the 
decomposition breaks down. We also consider two simple situations where 
the theorem appears to be misleading. A simple example is that of sending 
English text over an erasure channel. We can look for the most efficient 
binary representation of the text and send it over the channel. But the 
errors will be very difficult to decode. If, however, we send the English 
text directly over the channel, we can lose up to about half the letters and 
yet be able to make sense out of the message. Similarly, the human ear has 
some unusual properties that enable it to distinguish speech under very 
high noise levels if the noise is white. In such cases, it may be appropriate 
to send the uncompressed speech over the noisy channel rather than the 
compressed version. Apparently, the redundancy in the source is suited to 
the channel. 

Let us define the setup under consideration. We have a source V that 
generates symbols from an alphabet V. We will not make any assumptions 
about the kind of stochastic process produced by V other than that it is 
from a finite alphabet and satisfies the AEP. Examples of such processes 
include a sequence of i.i.d. random variables and the sequence of states 
of a stationary irreducible Markov chain. Any stationary ergodic source 
satisfies the AEP, as we show in Section 16.8. 

We want to send the sequence of symbols V n = V\, V 2 ,..., V n over 
the channel so that the receiver can reconstruct the sequence. To do this, 
we map the sequence onto a codeword X n (V n ) and send the codeword 
over the channel. The receiver looks at his received sequence Y n and 
makes an estimate V n of the sequence V n that was sent. The receiver 
makes an error if V n ♦ V n . We define the probability of error as 

MV n ^ V") = L i / 1 )， (7.136) 

y n v n 

where I is the indicator function and g(y n ) is the decoding function. The 
system is illustrated in Figure 7.14. 

We can now state the joint source-channel coding theorem: 


v n 



FIGURE 7.14. Joint source and channel coding. 
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Theorem 7.13.1 {Source-channel coding theorem) If V\, V 2 , • • • V n 
is a finite alphabet stochastic process that satisfies the AEP and //(V) < 
C, there exists a source-channel code with probability of error Pr(V n 7 ^ 
V n ) —> 0. Conversely, for any stationary stochastic process, if H(V) > C, 
the probability of error is bounded away from zero, and it is not possible 
to send the process over the channel with arbitrarily low probability of 
error. 


Proof: Achievability. The essence of the forward part of the proof is 
the two-stage encoding described earlier. Since we have assumed that the 
stochastic process satisfies the AEP, it implies that there exists a typical 
set of size < 2 w(// ( v ) +e ) which contains most of the probability. We 
will encode only the source sequences belonging to the typical set; all 
other sequences will result in an error. This will contribute at most 6 to 
the probability of error. 

We index all the sequences belonging to A^\ Since there are at most 
2«(//+e) suc h 

sequences, n{H + 6 ) bits suffice to index them. We can 
transmit the desired index to the receiver with probability of error less 
than e if 


H(V) + € = R <C. (7.137) 

The receiver can reconstruct V n by enumerating the typical set 
and choosing the sequence corresponding to the estimated index. This 
sequence will agree with the transmitted sequence with high probability. 
To be precise, 

P(V n ^ V n ) < P(V n i A 》)） + P(g(Y n ) ^ V n \V n G Af)) (7.138) 
<€ + € = 2e (7.139) 

for n sufficiently large. Hence, we can reconstruct the sequence with low 
probability of error for n sufficiently large if 

H(V) < C. (7.140) 

Converse: We wish to show that Pr(V 72 7 ^ V n ) —> 0 implies that H(V) 
< C for any sequence of source-channel codes 


X n (V n ) : v n x n , 

gn(Y n ) : ^ ^ V". 


(7.141) 

(7.142) 
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Thus X n {-) is an arbitrary (perhaps random) assignment of codewords 
to data sequences V n , and g n (-) is any decoding function (assignment of 
estimates V n to output sequences Y n . By Fano’s inequality, we must have 


H(V n \V n ) < 1 + Pr(V n ^ V n ) log |V"I = 

Hence for the code, 

(a) H(V U , V n ) 

H(V) < — —-—— 

_ n 


1 + p r (V n ^ V n )n\og\V\. 

(7.143) 

(7.144) 


H(V n ) 

n 


(7.145) 


=-H(V n \V n ) + -I(V n ; V n ) 
n n 

( < -(l+Pr(V w / V n )n\og\V\) + -I{y n \ V n ) 
n n 

( < -(1+ Pr(V n / V n )n log|V|) + -I(X n ; Y n ) 
n n 

( < -+Pr(t/ n ^ V n )log|V| + C, 
n 


(7.146) 

(7.147) 

(7.148) 

(7.149) 


where (a) follows from the definition of entropy rate of a stationary 
process, (b) follows from Fano’s inequality, (c) follows from the data- 
processing inequality (since V n ^ X n ^ Y n V n forms a Markov 
chain) and (d) follows from the memorylessness of the channel. Now 
letting n ^ oo, we have Vv{V n ^ V n ) —> 0 and hence 


H(V) < C. (7.150) 

□ 

Hence, we can transmit a stationary ergodic source over a channel if and 
only if its entropy rate is less than the capacity of the channel. The joint 
source-channel separation theorem enables us to consider the problem of 
source coding separately from the problem of channel coding. The source 
coder tries to find the most efficient representation of the source, and 
the channel coder encodes the message to combat the noise and errors 
introduced by the channel. The separation theorem says that the separate 
encoders (Figure 7.15) can achieve the same rates as the joint encoder 
(Figure 7.14). 

With this result, we have tied together the two basic theorems of 
information theory: data compression and data transmission. We will try 
to summarize the proofs of the two results in a few words. The data 




222 CHANNEL CAPACITY 


v n 

Source 


Channel 

X n (V n ) 

Channel 


Channel 


Source 

V n 


Encoder 


Encoder 


p(yk) 


Decoder 


Decoder 



FIGURE 7.15. Separate source and channel coding. 


compression theorem is a consequence of the AEP, which shows that 
there exists a “small” subset (of size 2 n// ) of all possible source sequences 
that contain most of the probability and that we can therefore represent 
the source with a small probability of error using H bits per symbol. 
The data transmission theorem is based on the joint AEP; it uses the 
fact that for long block lengths, the output sequence of the channel is 
very likely to be jointly typical with the input codeword, while any other 
codeword is jointly typical with probability ^ 2~ nI . Hence, we can use 
about 2 nI codewords and still have negligible probability of error. The 
source-channel separation theorem shows that we can design the source 
code and the channel code separately and combine the results to achieve 
optimal performance. 


SUMMARY 

Channel capacity. The logarithm of the number of distinguishable 
inputs is given by 

C = max/(X; Y). 

p ⑻ 


Examples 

• Binary symmetric channel: C = l — H(p). 

• Binary erasure channel: C = l — a. 

• Symmetric channel: C = log |3^| — //(row of transition matrix). 

Properties of C 


1. 0<C<min{log|Al,log|3；|}. 

2. /(X; F) is a continuous concave function of p(x). 


Joint typicality. The set of jointly typical sequences {(x n , y n )} 
with respect to the distribution p(x, y) is given by 


Af) 


{(x n ,y n ) e X n xy n 


— log p{x tl ) — H(X) 
n 


< h 


(7.151) 


(7.152) 
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--log p{y n ) - H{Y) <€, (7.153) 

n 

-~\ogp(x n ,y n )- H(X, Y) <e\, (7.154) 

n 

where p(x n , y n ) = flLi 〆 々 ， ％)• 

Joint AEP. Let (X n ， Y n ) be sequences of length n drawn i.i.d. accord- 
ing to p(x n , y n ) = n"=i Then: 

1. Pr((X n , Y n ) g Af) ^ 1 as n ^ oo. 

2 . \A^\ < 2 n{ ^ H(<x ' Y>)+€ \ 

3. If ( 戈 ' F n ) 〜 p(x n )p(y n ), then Pr((X n , Y n ) e 

C 2—«(/( 以 )-3 幻 . 


Channel coding theorem. All rates below capacity C are achievable, 
and all rates above capacity are not; that is, for all rates R < C, there 
exists a sequence of (2 nR , n) codes with probability of error 入 > 0 . 
Conversely, for rates R > C, 入⑻ is bounded away from 0. 

Feedback capacity. Feedback does not increase capacity for discrete 
memoryless channels (i.e., C^b = C). 

Source-channel theorem. A stochastic process with entropy rate H 
cannot be sent reliably over a discrete memoryless channel if H > 
C. Conversely, if the process satisfies the AEP, the source can be 
transmitted reliably if H < C. 


PROBLEMS 


7.1 Preprocessing the output. One is given a communication chan¬ 
nel with transition probabilities p{y\x) and channel capacity C = 
maxp(^) I{X\ Y). A helpful statistician preprocesses the output by 
forming Y = g(Y). He claims that this will strictly improve the 
capacity. 

(a) Show that he is wrong. 

(b) Under what conditions does he not strictly decrease the 
capacity? 
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and X g {0, 1,..., 10}. Assume that Z is independent of X. 

(a) Find the capacity. 

(b) What is the maximizing p*(x)? 

7.5 Using two channels at once. Consider two discrete memoryless 
channels {X\, p{y\ \ x\), y\) and (Af 2 , piyi \ 义 2 ) ，於 ） with capac¬ 
ities Ci and C 2 , respectively. A new channel {X\ x X 2 , p{y\ \ 
x\) x p(y 2 I X 2 ), yy x ^ 2 ) is formed in which x\ e X\ and X 2 G X 2 
are sent simultaneously, resulting in y\, y 2 . Find the capacity of this 
channel. 

7.6 Noisy typewriter. Consider a 26-key typewriter. 

(a) If pushing a key results in printing the associated letter, what 
is the capacity C in bits? 


7.2 Additive noise channel. Find the channel capacity of the following 
discrete memoryless channel: 

z 


x — K±) ~^ 

where Pr{Z = 0} = Pr{Z = a} = ^. The alphabet for x is X = 
{0, 1}. Assume that Z is independent of X. Observe that the channel 
capacity depends on the value of a. 

7.3 Channels with memory have higher capacity. Consider a binary 
symmetric channel with [•= 尤 ㊉ Z/ ， where ㊉ is mod 2 addi¬ 
tion, and Xi, Y[ g {0, 1}. Suppose that {Z，} has constant marginal 
probabilities Pr{Z/ = 1} = p = \ — Pr{Z/ = 0}, but that Zi, Z 2 , 
… ， Z n are not necessarily independent. Assume that Z n is inde¬ 
pendent of the input X n • Let C = l — H(p, 1 — p). Show that 
^^^p(xi,X2,---,x n ) I (-^ 1 ? 乂 2, • . • ， X n \ Y\, Y 2 , • • •, 

Y n ) > nC. 

7.4 Channel capacity. Consider the discrete memoryless channel Y = 
X + Z (mod 11)，where 
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(b) Now suppose that pushing a key results in printing that letter or 
the next (with equal probability). Thus, A —> A or , Z —> 
Z or A. What is the capacity? 

(c) What is the highest rate code with block length one that you 
can find that achieves zero probability of error for the channel 
in part (b)? 

7.7 Cascade of binary symmetric channels. Show that a cascade of n 
identical independent binary symmetric channels, 


X 0 4 I BSC I ^ —— ^ Z„_1 ^ I BSC I ^ X n , 


each with raw error probability p, is equivalent to a single BSC with 
error probability i(l — (1 — 2p) n ) and hence that lim I(Xq ； X n ) 

= 0 if p ^ 0, 1. No encoding or decoding takes place at the inter¬ 
mediate terminals , X n -\. Thus, the capacity of the cascade 

tends to zero. 

7.8 Z-channel. The Z-channel has binary input and output alphabets 
and transition probabilities p(y\x) given by the following matrix: 

^ = 1/2 1/2 x ， ;ye{0 ， l} 

Find the capacity of the Z-channel and the maximizing input prob¬ 
ability distribution. 

7.9 Suboptimal codes. For the Z-channel of Problem 7.8, assume that 
we choose a (2 nR , n) code at random, where each codeword is a 
sequence of fair coin tosses. This will not achieve capacity. Find the 
maximum rate R such that the probability of error Pe ，l \ averaged 
over the randomly generated codes, tends to zero as the block length 
n tends to infinity. 

7.10 Zero-error capacity. A channel with alphabet {0, 1, 2, 3, 4} has tran¬ 
sition probabilities of the form 


p{y\x )= 


1/2 if y = x 土 1 mod 5 
0 otherwise. 


(a) Compute the capacity of this channel in bits. 
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(b) The zero-error capacity of a channel is the number of bits per 
channel use that can be transmitted with zero probability of 
error. Clearly, the zero-error capacity of this pentagonal chan¬ 
nel is at least 1 bit (transmit 0 or 1 with probability 1/2). Find 
a block code that shows that the zero-error capacity is greater 
than 1 bit. Can you estimate the exact value of the zero-error 
capacity? (Hint: Consider codes of length 2 for this channel.) 
The zero-error capacity of this channel was finally found by 
Lovasz [365]. 

7.11 Time-varying channels. Consider a time-varying discrete memory¬ 
less channel. 

Let Y 2 ,.. •, Y n be conditionally independent given X\, X 2 , 

..., X n , with conditional distribution given by p(y \ x) = YYi=i 
Piiyi I Xi). Let X = (Zi,X 2 ,..., ^),Y = (Fi, Y 2 ,...,Y n ). Find 
max p(x) /(X; Y). 



7.12 Unused symbols. Show that the capacity of the channel with prob¬ 
ability transition matrix 


r 2 

3 

p y\^ = \ 

0 



(7.155) 


is achieved by a distribution that places zero probability on one 
of input symbols. What is the capacity of this channel? Give an 
intuitive reason why that letter is not used. 

7.13 Erasures and errors in a binary channel. Consider a channel with 
binary inputs that has both erasures and errors. Let the probability 
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of error be € and the probability of erasure be a, so the channel is 
follows: 


1 — a — € 



(a) Find the capacity of this channel. 

(b) Specialize to the case of the binary symmetric channel (a = 0). 

(c) Specialize to the case of the binary erasure channel (6 = 0). 

7.14 Channels with dependence between the letters. Consider the fol¬ 
lowing channel over a binary alphabet that takes in 2-bit symbols 
and produces a 2-bit output, as determined by the following map¬ 
ping: 00 ^ 01, 01 —> 10, 10 w 11, and 11 —> 00. Thus, if the 
2-bit sequence 01 is the input to the channel, the output is 10 with 
probability 1. Let X\, X 2 denote the two input symbols and Fi, Y 2 
denote the corresponding output symbols. 

(a) Calculate the mutual information I(X\, X 2 ', Y\, Y 2 ) as a func¬ 
tion of the input distribution on the four possible pairs of inputs. 

(b) Show that the capacity of a pair of transmissions on this chan¬ 
nel is 2 bits. 

(c) Show that under the maximizing input distribution, I{X\\ Y\) 
= 0. Thus, the distribution on the input sequences that achieves 
capacity does not necessarily maximize the mutual information 
between individual symbols and their corresponding outputs. 

7.15 Jointly typical sequences. As we did in Problem 3.13 for the typical 
set for a single random variable, we will calculate the jointly typical 
set for a pair of random variables connected by a binary symmetric 
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channel, and the probability of error for jointly typical decoding 
for such a channel. 


0.9 



We consider a binary symmetric channel with crossover probability 
0.1. The input distribution that achieves capacity is the uniform 
distribution [i.e., p{x) = ( 3 , 3 )]，which yields the joint distribution 
p(x, y) for this channel is given by 


X 

0 

1 

0 

0.45 

0.05 

1 

0.05 

0.45 


The marginal distribution of F is also (| ， 士 ) • 

(a) Calculate H(X) ， H(Y), H(X, Y), and I(X; Y) for the joint 
distribution above. 

(b) Let X\, X 2 ,..., be drawn i.i.d. according the Bernoulli(^) 
distribution. Of the 2 n possible input sequences of length n, 
which of them are typical [i.e.，member of A^{X) for 6 = 
0.2]? Which are the typical sequences in A^iY)! 

(c) The jointly typical set A^ n \X, Y) is defined as the set of 
sequences that satisfy equations (7.35-7.37). The first two 
equations correspond to the conditions that x n and y n are in 
A^\X) and A^ n \Y), respectively. Consider the last condi¬ 
tion, which can be rewritten to state that — ^ log p(x n , y n ) G 
{H{X, Y) — €, H(X, Y) + e). Let k be the number of places 
in which the sequence x n differs from y n (k is a function of 
the two sequences). Then we can write 


P(x n ,y n ) = Y[p( x i^yt) 


( 7 . 156 ) 
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(7.157) 

(7.158) 


An alternative way at looking at this probability is to look at the 
binary symmetric channel as in additive channel Y = X ㊉ Z ， 
where Z is a binary random variable that is equal to 1 with 
probability p, and is independent of X. In this case, 


p(x n , y n ) = p{x n )p{y n \x n ) 

(7.159) 

= p(x n )p(z n \x n ) 

(7.160) 

= p(x n )p(z n ) 

(7.161) 

= (\X a-pr- k p k - 

(7.162) 


Show that the condition that (x n , y n ) being jointly typical is 
equivalent to the condition that is typical and z n = y n — x n 
is typical. 

(d) We now calculate the size of A^\Z) for n = 25 and e = 0.2. 
As in Problem 3.13, here is a table of the probabilities and 
numbers of sequences with k ones: 


k 

© 

GVd - p) n ~ k 

- l -\ 0 gp{x n ) 

0 

1 

0.071790 

0.152003 

1 

25 

0.199416 

0.278800 

2 

300 

0.265888 

0.405597 

3 

2300 

0.226497 

0.532394 

4 

12650 

0.138415 

0.659191 

5 

53130 

0.064594 

0.785988 

6 

177100 

0.023924 

0.912785 

7 

480700 

0.007215 

1.039582 

8 

1081575 

0.001804 

1.166379 

9 

2042975 

0.000379 

1.293176 

10 

3268760 

0.000067 

1.419973 

11 

4457400 

0.000010 

1.546770 

12 

5200300 

0.000001 

1.673567 


[Sequences with more than 12 ones are omitted since their total 
probability is negligible (and they are not in the typical set).] 
What is the size of the set A^\Z)1 
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(e) Now consider random coding for the channel, as in the proof 
of the channel coding theorem. Assume that 2 nR codewords 
X n (l), X n (2),. . X n (2 nR ) are chosen uniformly over the 2 n 
possible binary sequences of length n. One of these codewords 
is chosen and sent over the channel. The receiver looks at 
the received sequence and tries to find a codeword in the 
code that is jointly typical with the received sequence. As 
argued above, this corresponds to finding a codeword X n (i) 
such that Y n — X n {i) g A^\Z). For a fixed codeword x n {i), 
what is the probability that the received sequence Y n is such 
that (x n (i), Y n ) is jointly typical? 

(f) Now consider a particular received sequence y n = 
000000 …0 ， say. Assume that we choose a sequence 
X n at random, uniformly distributed among all the 2 n possible 
binary n-sequences. What is the probability that the chosen 
sequence is jointly typical with this y n l [Hint : This is the 
probability of all sequences x n such that y n — x n e A^ n \Z).] 

(g) Now consider a code with 2 9 = 512 codewords of length 12 
chosen at random, uniformly distributed among all the 2 n 
sequences of length n = 25. One of these codewords, say 
the one corresponding to i = 1, is chosen and sent over the 
channel. As calculated in part (e), the received sequence, with 
high probability, is jointly typical with the codeword that was 
sent. What is the probability that one or more of the other 
codewords (which were chosen at random, independent of the 
sent codeword) is jointly typical with the received sequence? 
[Hint: You could use the union bound, but you could also 
calculate this probability exactly, using the result of part (f) 
and the independence of the codewords.] 

(h) Given that a particular codeword was sent, the probability of 
error (averaged over the probability distribution of the chan¬ 
nel and over the random choice of other codewords) can be 
written as 

Pr(Error|x n (l) sent) = E y "： y ™causes error PW ⑴） . （ 7.163) 

There are two kinds of error: the first occurs if the received 
sequence y n is not jointly typical with the transmitted code¬ 
word, and the second occurs if there is another codeword 
jointly typical with the received sequence. Using the result 
of the preceding parts, calculate this probability of error. By 
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the symmetry of the random coding argument, this does not 
depend on which codeword was sent. 

The calculations above show that average probability of error for 
a random code with 512 codewords of length 25 over the binary 
symmetric channel of crossover probability 0.1 is about 0.34. This 
seems quite high, but the reason for this is that the value of 6 that 
we have chosen is too large. By choosing a smaller 6 and a larger 
n in the definitions of A^\ we can get the probability of error to 
be as small as we want as long as the rate of the code is less than 
/(Z;F)-3€. 

Also note that the decoding procedure described in the problem 
is not optimal. The optimal decoding procedure is maximum like¬ 
lihood (i.e., to choose the codeword that is closest to the received 
sequence). It is possible to calculate the average probability of 
error for a random code for which the decoding is based on an 
approximation to maximum likelihood decoding, where we decode 
a received sequence to the unique codeword that differs from the 
received sequence in < 4 bits, and declare an error otherwise. The 
only difference with the jointly typical decoding described above 
is that in the case when the codeword is equal to the received 
sequence! The average probability of error for this decoding scheme 
can be shown to be about 0.285. 

7.16 Encoder and decoder as part of the channel. Consider a binary 
symmetric channel with crossover probability 0.1. A possible cod¬ 
ing scheme for this channel with two codewords of length 3 is to 
encode message a\ as 000 and as 111. With this coding scheme, 
we can consider the combination of encoder, channel, and decoder 
as forming a new BSC, with two inputs a\ and and two outputs 
a\ and ai. 

(a) Calculate the crossover probability of this channel. 

(b) What is the capacity of this channel in bits per transmission of 
the original channel? 

(c) What is the capacity of the original BSC with crossover prob¬ 
ability 0.1? 

(d) Prove a general result that for any channel, considering the 
encoder, channel, and decoder together as a new channel from 
messages to estimated messages will not increase the capacity 
in bits per transmission of the original channel. 

7.17 Codes of length 3 for a BSC and BEC. In Problem 7.16, the prob¬ 
ability of error was calculated for a code with two codewords of 
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length 3 (000 and 111) sent over a binary symmetric channel with 
crossover probability 6 . For this problem, take € = 0.1. 

(a) Find the best code of length 3 with four codewords for this 
channel. What is the probability of error for this code? (Note 
that all possible received sequences should be mapped onto 
possible codewords.) 

(b) What is the probability of error if we used all eight possible 
sequences of length 3 as codewords? 

(c) Now consider a binary erasure channel with erasure probability 
0.1. Again, if we used the two-codeword code 000 and 111, 
received sequences 00E, 0E0, E00, 0EE, E0E, EE0 would all 
be decoded as 0, and similarly, we would decode 11E, 1E1, 
Ell, 1EE, E1E, EE1 as 1. If we received the sequence EEE, 
we would not know if it was a 000 or a 111 that was sent — so 
we choose one of these two at random, and are wrong half the 
time. What is the probability of error for this code over the 
erasure channel? 

(d) What is the probability of error for the codes of parts (a) and 
(b) when used over the binary erasure channel? 

7.18 Channel capacity. Calculate the capacity of the following channels 
with probability transition matrices: 

(a) { 0 , 1 , 2 } 


p(y\x)= 



(7.164) 


(b) ^=^={0,1,2} 


2 2 


p(y\x) = o I 


2 2 


(7.165) 


(c) 1=V={0, 1,2,3} 


p(y\x )= 


p 1 — p 0 0 

l — p p 0 0 

0 0 q 1 — q 

0 0 1 — q q 


(7.166) 
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7.19 Capacity of the carrier pigeon channel. Consider a commander of 
an army besieged in a fort for whom the only means of commu¬ 
nication to his allies is a set of carrier pigeons. Assume that each 
carrier pigeon can carry one letter (8 bits), that pigeons are released 
once every 5 minutes, and that each pigeon takes exactly 3 minutes 
to reach its destination. 

(a) Assuming that all the pigeons reach safely, what is the capacity 
of this link in bits/hour? 

(b) Now assume that the enemies try to shoot down the pigeons 
and that they manage to hit a fraction a of them. Since the 
pigeons are sent at a constant rate, the receiver knows when 
the pigeons are missing. What is the capacity of this link? 

(c) Now assume that the enemy is more cunning and that every 
time they shoot down a pigeon, they send out a dummy pigeon 
carrying a random letter (chosen uniformly from all 8-bit let¬ 
ters). What is the capacity of this link in bits/hour? 

Set up an appropriate model for the channel in each of the above 
cases, and indicate how to go about finding the capacity. 

7.20 Channel with two independent looks at Y. Let Y\ and be condi¬ 
tionally independent and conditionally identically distributed given 
X. 

(a) Show that /(X; Y U Y 2 ) = 2/(X; h) - I(Y U Y 2 ). 

(b) Conclude that the capacity of the channel 


X - ^ - (^1, y 2 ) 


is less than twice the capacity of the channel 


乂 - ► - ^ ^ 


7.21 Tall, fat people. Suppose that the average height of people in a 
room is 5 feet. Suppose that the average weight is 100 lb. 

(a) Argue that no more than one-third of the population is 15 feet 
tall. 

(b) Find an upper bound on the fraction of 300-lb 10-footers in 
the room. 
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7.22 

7.23 

7.24 


7.25 


Can signal alternatives lower capacity? Show that adding a row to 
a channel transition matrix does not decrease capacity. 

Binary multiplier channel 

(a) Consider the channel Y = XZ, where X and Z are independent 
binary random variables that take on values 0 and 1. Z is 
Bernoulli(of) [i.e., P(Z = 1) = a]. Find the capacity of this 
channel and the maximizing distribution on X. 

(b) Now suppose that the receiver can observe Z as well as Y. 
What is the capacity? 

Noise alphabets. Consider the channel 



{0, 1, 2, 3}, where Y = X + Z, and Z is uniformly distributed 
over three distinct integer values Z = {z\, zi, Z3}. 

(a) What is the maximum capacity over all choices of the Z alpha¬ 
bet? Give distinct integer values zi, Z2» Z3 and a distribution on 
X achieving this. 

(b) What is the minimum capacity over all choices for the Z alpha¬ 
bet? Give distinct integer values zi, zi, Z3 and a distribution on 
X achieving this. 

Bottleneck channel. Suppose that a signal X e X = {1,2,..., m} 
goes through an intervening transition X —— > V —— ^ Y: 


x - P(v\x) v P (y\v) - Y 


7.22 

7.23 

7.24 


7.25 


where x = {1, 2 , • • • ， m }， y = {1 ， 2 , • . • ， m }， and = {1, 2 , • • • ，众 }• 
Here p(v\x) and p(y\v) are arbitrary and the channel has transition 
probability p{y\x) = p{v\x)p{y\v). Show that C < logk. 
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7.26 Noisy typewriter. Consider the channel with x, j G {0, 1, 2, 3} and 
transition probabilities p{y\x) given by the following matrix: 


■ i i 0 0 " 

0^0 

0 0 2 I 

_ I 0 0 I _ 


(a) Find the capacity of this channel. 

(b) Define the random variable z = g(y), where 


g(y) 


A ify e {0,1} 
B ify e {2, 3}. 


For the following two PMFs for x, compute I(X; Z): 


⑴ 


p(x)= 


5 i/x g{1,3} 
0 ifx e {0, 2}. 


(ii) 


p(x)= 


0 ifx e {1,3} 
5 ifx e{0, 2}. 


(c) Find the capacity of the channel between x and z，specifically 
where jc G {0, 1, 2, 3}, z G {A, B}, and the transition probabil¬ 
ities P{z\x) are given by 


p(Z = z\X = x)= J2 P(Y = y 0 \X = x). 

g(yo)=z 

(d) For the X distribution of part (i) of (b), does X ^ Z ^ Y 
form a Markov chain? 


7.27 Erasure channel. Let {X, p(y\x), y} be a discrete memoryless chan¬ 
nel with capacity C. Suppose that this channel is cascaded imme¬ 
diately with an erasure channel {3^, S} that erases a of its 

symbols. 


P(y\x) —/ 


x 


s 
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Specifically, S = {y\, 3 ^ 2 , • • •, Jm, e}, and 

Pr{5 = y|X =x} = ap(y\x), y ey, 

Pr{5 = e\X = x} = a. 

Determine the capacity of this channel. 

7.28 Choice of channels. Find the capacity C of the union of two chan¬ 
nels (Xi, pi(yi\x\), y\) and ^ 2 ), where at each 

time, one can send a symbol over channel 1 or channel 2 but 
not both. Assume that the output alphabets are distinct and do not 
intersect. 

(a) Show that 2 C = 2 Cl + 2 Cl . Thus, 2 C is the effective alphabet 
size of a channel with capacity C. 

(b) Compare with Problem 2.10 where 2 H = 2 Hl + 2 H2 , and inter¬ 
pret part (a) in terms of the effective number of noise-free 
symbols. 

(c) Use the above result to calculate the capacity of the following 
channel. 


1 -p 



2 


2 


7.29 Binary multiplier channel 

(a) Consider the discrete memoryless channel Y = XZ, where X 
and Z are independent binary random variables that take on 
values 0 and 1. Let P(Z = l) = a. Find the capacity of this 
channel and the maximizing distribution on X. 

(b) Now suppose that the receiver can observe Z as well as Y. 
What is the capacity? 





7.30 Noise alphabets. Consider the channel 
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z 



X = {0, 1, 2, 3}, where Y = X + Z, and Z is uniformly distributed 
over three distinct integer values Z = {zi, zi, Z 3 }. 

(a) What is the maximum capacity over all choices of the Z alpha¬ 
bet? Give distinct integer values zi, zi, Z3 and a distribution on 
X achieving this. 

(b) What is the minimum capacity over all choices for the Z alpha¬ 
bet? Give distinct integer values zi, Z 2 ? Z 3 and a distribution on 
X achieving this. 

7.31 Source and channel. We wish to encode a Bemoulli(of) process 
V\, V 2 , ... for transmission over a binary symmetric channel with 
crossover probability p. 



Find conditions on a and p so that the probability of error P(V n ^ 
V n ) can be made to go to zero as n —— > oo. 


7.32 Random 20 questions. Let X be uniformly distributed over {1, 2, 

..., m). Assume that m = 2 n . We ask random questions: Is X g 5i? 
Is X g 52 ? ... until only one integer remains. All 2 m subsets S of 
{1 ，2 , … ， m} are equally likely. 

(a) How many deterministic questions are needed to determine XI 


(b) Without loss of generality, suppose that X = l is the random 
object. What is the probability that object 2 yields the same 
answers as object 1 for k questions? 

(c) What is the expected number of objects in {2, 3,, m) that 
have the same answers to the questions as those of the correct 
object 1? 
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(d) Suppose that we ask n + ^/n random questions. What is the 
expected number of wrong objects agreeing with the answers? 

(e) Use Markov’s inequality Pr{X > t/x} < y, to show that the 
probability of error (one or more wrong object remaining) goes 
to zero as n —— > oc. 

7.33 BSC with feedback. Suppose that feedback is used on a binary 

symmetric channel with parameter p. Each time a F is received, 
it becomes the next transmission. Thus, X\ is Bern(i), X 2 = Y\, 
^3 ~ • • •» X n = Y n —\. 

(a) Find lim^oo x -I(X n - Y n ). 

(b) Show that for some values of p, this can be higher than capac- 
ity. 

(c) Using this feedback transmission scheme, X n (W, Y n ) = (X\ 
(W), Y\,Y 2 ,, F m _i), what is the asymptotic communication 
rate achieved; that is, what is liir^oo i/(VK; Y n )l 

7.34 Capacity. Find the capacity of 
(a) Two parallel BSCs: 



(b) BSC and a single symbol: 


2 



1 


2 


X 


3 


3 


Y 









(c) BSC and a ternary channel: 
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(d) Ternary channel: 

p(y\x )= 


(7.167) 


(b) What about the capacity of 


V- 


V 

0 


0 

h 


where Ik if the k x k identity matrix. 

7.36 Channel with memory. Consider the discrete memoryless channel 
Yi = Z[Xi with input alphabet Xi e {—1,1}. 

(a) What is the capacity of this channel when {Z z } is i.i.d. with 


Z/ 


1, p = 0.5 

-1, p = 0.5? 


(7.168) 


7.35 Capacity. Suppose that channel V has capacity C, where V is an 
m x n channel matrix. 

(a) What is the capacity of 






o 


2-3 o 


o 1 
ro 


p 
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Now consider the channel with memory. Before transmission 
begins, Z is randomly chosen and fixed for all time. Thus, 
Y t = ZX t . 

(b) What is the capacity if 


1 ， p = 0.5 

—1 ， p = 0.5? 


(7.169) 


7.37 Joint typicality. Let (X/, Z/) be i.i.d. according to p(x, y, z). We 

will say that (x n , y n , z n ) is jointly typical [written (x n , y n , z n ) G 
A^] if 

• p(x n ) e ⑻士 e ). 

• p{y n ) G 2~ n{H(Y)±e \ 

• p(z n ) G 2~ n( ' H( ' Z ^ ±€ \ 

• p{x n ,y n ) G 2~ n{H{XJ)±€ \ 

• p(x n ,z n ) e 2- n{H{x ^ ±€ ^. 

• p(y n , z n ) g 2~ n( ' H( ' Y,z ^ ±€ \ 

• p(x n ,y n ,z n ) e 2- n(H(x ^ z)±€ \ 

Now suppose that (X n , Y n , Z n ) is drawn according to p(x n )p(y n ) 
p(z n )- Thus, X n , Y n , Z n have the same marginals as p(x n , y n , z n ) 
but are independent. Find (bounds on) Pr{(X n ,Y n ,Z n ) G A^} in 
terms of the entropies H(X) ， H(Y) ， H(Z),H(X, Y),H(X ， Z), 
H(Y, Z), and H(X, Y, Z). 


HISTORICAL NOTES 

The idea of mutual information and its relationship to channel capacity 
was developed by Shannon in his original paper [472]. In this paper, he 
stated the channel capacity theorem and outlined the proof using typical 
sequences in an argument similar to the one described here. The first 
rigorous proof was due to Feinstein [205], who used a painstaking <4 cookie- 
cutting” argument to find the number of codewords that can be sent with a 
low probability of error. A simpler proof using a random coding exponent 
was developed by Gallager [224]. Our proof is based on Cover [121] and 
on Forney’s unpublished course notes [216]. 

The converse was proved by Fano [201], who used the inequality bear¬ 
ing his name. The strong converse was first proved by Wolfowitz [565], 
using techniques that are closely related to typical sequences. An iterative 
algorithm to calculate the channel capacity was developed independently 
by Arimoto [25] and Blahut [65]. 
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The idea of the zero-error capacity was developed by Shannon [474]; 
in the same paper, he also proved that feedback does not increase the 
capacity of a discrete memoryless channel. The problem of finding the 
zero-error capacity is essentially combinatorial; the first important result 
in this area is due to Lovasz [365]. The general problem of finding the 
zero error capacity is still open; see a survey of related results in Korner 
and Orlitsky [327]. 

Quantum information theory, the quantum mechanical counterpart to 
the classical theory in this chapter, is emerging as a large research area in 
its own right and is well surveyed in an article by Bennett and Shor [49] 
and in the text by Nielsen and Chuang [395]. 



CHAPTER 8 


DIFFERENTIAL ENTROPY 


We now introduce the concept of differential entropy, which is the entropy 
of a continuous random variable. Differential entropy is also related to the 
shortest description length and is similar in many ways to the entropy of 
a discrete random variable. But there are some important differences, and 
there is need for some care in using the concept. 


8.1 DEFINITIONS 

Definition Let X be a random variable with cumulative distribution 
function F(x) = Pr(X < x). If F(x) is continuous, the random variable 
is said to be continuous. Let f(x) = F\x) when the derivative is defined. 
If f 二 f{x) = 1, f{x) is called the probability density function for X. The 
set where f(x) > 0 is called the support set of X. 

Definition The differential entropy h{X) of a continuous random vari¬ 
able X with density f{x) is defined as 



where S is the support set of the random variable. 

As in the discrete case, the differential entropy depends only on the 
probability density of the random variable, and therefore the differential 
entropy is sometimes written as h(f) rather than h(X). 

Remark As in every example involving an integral, or even a density, 
we should include the statement if it exists. It is easy to construct examples 
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of random variables for which a density function does not exist or for 
which the above integral does not exist. 

Example 8.1.1 (Uniform distribution) Consider a random variable dis¬ 
tributed uniformly from 0 to a so that its density is 1 /a from 0 to a and 0 
elsewhere. Then its differential entropy is 

r a l l 

h(X) = — I — log - dx = log a. (8.2) 

Jo a a 

Note.. For a < 1, log a < 0, and the differential entropy is negative. Hence, 
unlike discrete entropy, differential entropy can be negative. However, 
2 h (X) _ 2 Xo ^ a = a is the volume of the support set, which is always non¬ 
negative, as we expect. 


i —乂 

Example 8.1.2 (Normal distribution) Let X 〜 0(x) = , . 

• . . . . v2jtcr z 

Then calculating the differential entropy in nats, we obtain 


= — I 0 ln0 




2o* 2 


lnV2 


7TCT Z 


2 


EX . 

2 ^ + 2 ln2 ^ 

—+ - In 2na 2 
2 2 


-lne -\ —— In 2na z 
2 2 


一 In 2JT6CT 2 
2 


nats. 


Changing the base of the logarithm, we have 


h((j)) = - log 2nea^ bits. 


(8.3) 

(8.4) 

(8.5) 

(8.6) 

(8.7) 

(8.8) 


(8.9) 
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8.2 AEP FOR CONTINUOUS RANDOM VARIABLES 

One of the important roles of the entropy for discrete random variables 
is in the AEP, which states that for a sequence of i.i.d. random variables, 
p(X\, X 2 , …， X n ) is close to 2 -nH 、 x 、 with high probability. This enables 
us to define the typical set and characterize the behavior of typical sequences. 
We can do the same for a continuous random variable. 

Theorem 8.2.1 Let X\, X 2 , ..., X n be a sequence of random vari¬ 
ables drawn i.i.d. according to the density f{x). Then 

1 

—— log f(X\, X 2 , ... ， X n ) E[— log f(X)] = h{X) in probability, 
n 

( 8 . 10 ) 

Proof: The proof follows directly from the weak law of large numbers. 

□ 


This leads to the following definition of the typical set. 

Definition For 6 > 0 and any n, we define the typical set with 
respect to f(x) as follows: 


A?) = 

(xi,x 2 ,..., x n ) e S' 1 : 

1 

—— logf(xi,x 2 , - h(X) 

n 

< € 




(B .： 


where /(xi,x 2 , ...,x n ) = fl/Li /( 々 )• 


The properties of the typical set for continuous random variables par¬ 
allel those for discrete random variables. The analog of the cardinality of 
the typical set for the discrete case is the volume of the typical set for 
continuous random variables. 


Definition 


The volume Vol(A) of a set A C TZ n is defined as 


Vol(A) = / dx\ dx 2 - - - dx n . 
J A 


( 8 . 12 ) 


Theorem 8.2.2 The typical set has the following properties: 

1. Pr (Ap)) > 1 — € for n sufficiently large. 

2. Vol (A^ } ) < 2 n ^ x)+€) for all n. 

3. Vol (A$ w )) > (1 — €)2 n ( h ( x )_ € ) for n sufficiently large. 
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Proof: By Theorem 8.2.1, -j\ogf{X n ) = -^I]log/(X/) ^ h(X) in 
probability, establishing property 1. Also, 

1=1 f(xi,X 2 , ..., x n ) dx\ dx 2 - - - dx n (8.13) 

Js n 

> j /(xi, X 2 ,..., x n ) dx\ dx 2 - - - dx n (8.14) 

> J ^ 2 - n{h(X)+€) dx x dx 2 ■■- dx n (8.15) 

= 2 - n ( h ( x )+。f dxi dx 2 ••• dx n (8.16) 

= 2~ n ^ h{X)+€) Vol (A?)) • (8.17) 

Hence we have property 2. We argue further that the volume of the typical 
set is at least this large. If n is sufficiently large so that property 1 is 
satisfied, then 


l — € < y ㈠ f(x\, X 2 , …， x n ) dx\ dX 2 - 

•- dx n 

(8.18) 

< J ^ ^ dxi dx 2 … dx n 


(8.19) 

= J ^ ^ dx\ dx 2 - - - dx n 


(8.20) 

= 2 - n{h{X) ~ €) Vol (A 》)) ， 


(8.21) 

establishing property 3. Thus for n sufficiently large, 

we have 


(1 _ €)2 雄 ( 又 )- e ) < Vol(A ⑻） < 2 雄 ( 又 ) +€ ). 

□ 

(8.22) 


Theorem 8.2.3 The set is the smallest volume set with probability 
> l — to first order in the exponent. 


Proof: Same as in the discrete case. □ 

This theorem indicates that the volume of the smallest set that contains 
most of the probability is approximately 2 nh . This is an ^-dimensional 

volume, so the corresponding side length is (2 nh Y = 2 h . This provides 
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an interpretation of the differential entropy: It is the logarithm of the 
equivalent side length of the smallest set that contains most of the prob¬ 
ability. Hence low entropy implies that the random variable is confined 
to a small effective volume and high entropy indicates that the random 
variable is widely dispersed. 

Note. Just as the entropy is related to the volume of the typical set, there 
is a quantity called Fisher information which is related to the surface 
area of the typical set. We discuss Fisher information in more detail in 
Sections 11.10 and 17.8. 


8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE 
ENTROPY 


Consider a random variable X with density f(x) illustrated in Figure 8.1. 
Suppose that we divide the range of X into bins of length A. Let us 
assume that the density is continuous within the bins. Then, by the mean 
value theorem, there exists a value xi within each bin such that 


f(xi)A = 


/ »(/+l)A 
liA 


f(x) dx. 


(8.23) 


Consider the quantized random variable X A , which is defined by 

=Xi if iA<X< (i + 1)A. (8.24) 



FIGURE 8.1. Quantization of a continuous random variable. 
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Then the probability that X A = Xi is 

?+l)A 



Pi = 


f(x)dx = 


(8.25) 


The entropy of the quantized version is 


//(X A ) = - ^ pi log pi 


(8.26) 


—oo 
oo 


=-D / ⑹ A log(/(x/) A) 


(8.27) 


= A/to) log f(Xi) - ^2 /te)Alog A 

= △/(々） lo g /(々）_ iog △， 


(8.28) 


(8.29) 


since f(Xi)A = f f(x) = l. If f{x) log f(x) is Riemann integrable (a 
condition to ensure that the limit is well defined [556])，the first term in 
(8.29) approaches the integral of —f(x)logf(x) as A 0 by definition 
of Riemann integrability. This proves the following. 

Theorem 8.3.1 If the density f(x) of the random variable X is Rie¬ 
mann integrable, then 


H(X a ) + log A ^ h{f) = h{X), (WA — O. (8.30) 


Thus, the entropy of an n-bit quantization of a continuous random vari¬ 
able X is approximately h{X) + n. 

Example 8.3.1 

1. If X has a uniform distribution on [0, 1] and we let A = 2~ n , 
then h = 0, H(X A ) = n, and n bits suffice to describe X to n 
bit accuracy. 

2. If X is uniformly distributed on [0, |]，the first 3 bits to the right 
of the decimal point must be 0. To describe X to n-bit accuracy 
requires only n — 3 bits, which agrees with h(X) = —3. 

3. If X 〜 A/*(0 ， cr 2 ) with a 2 = 100， describing X to n bit accuracy 
would require on the average n ^ logilnecr 2 ) = n + 5.37 bits. 
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In general, h(X) + w is the number of bits on the average required to 
describe X to /i-bit accuracy. 

The differential entropy of a discrete random variable can be considered 
to be —oo. Note that 2 — 00 = 0, agreeing with the idea that the volume of 
the support set of a discrete random variable is zero. 


8.4 JOINT AND CONDITIONAL DIFFERENTIAL ENTROPY 

As in the discrete case, we can extend the definition of differential entropy 
of a single random variable to several random variables. 

Definition The differential entropy of a set X\, X 2 ,..., X n of random 
variables with density f(x\, X 2 , …， x n ) is defined as 

h(X u X 2 ,...,X n ) = - I f(x n )\ogf(x n )dx n . (8.31) 

Definition If X, Y have a joint density function f(x, y), we can define 
the conditional differential entropy h{X\Y) as 

h(X\Y) = — j f(x, y) log f(x\y) dx dy. (8.32) 

Since in general f(x\y) = f(x, y)/f(y), we can also write 

h(X\Y) = h(X, Y) - h{Y). (8.33) 

But we must be careful if any of the differential entropies are infinite. 

The next entropy evaluation is used frequently in the text. 

Theorem 8.4.1 (Entropy of a multivariate normal distribution) Let 
X\, X 2 ,..., X n have a multivariate normal distribution with mean \x and 
covariance matrix K. Then 

h(X u X 2 , … ， X n ) = K)) = ^\og{2ne) n \K\ bits, (8.34) 

where \K\ denotes the determinant of K. 
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2E^ + 2 ln(27r)niK, 

j 

(8.42) 

^+ l -\nil7i) n \K\ 

(8.43) 

^ 111(271^)^1^1 nats 

(8.44) 

^\og{2ne) n \K\ bits. □ 

(8.45) 


8.5 RELATIVE ENTROPY AND MUTUAL INFORMATION 

We now extend the definition of two familiar quantities, D{f\\g) and 
I{X\ F), to probability densities. 


(K- l ).. + ^\n(2nr\K\ (8.40) 

j i 


Proof: The probability density function of X\, X 2 ,.. •, X n is 


/(x) = 


Then 


(V2^) 7 V|i 






h(f) = - /(x) 


—-(x — /x) T K~ l (x — /x) — In (V^) \K\^ 


(8.35) 


dx 


(8.36) 


■E 


广 〜•) 


+ -\n(2nr\K\ (8.37) 
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Definition The relative entropy (or Kullback—Leibler distance) D(f\\g) 
between two densities / and g is defined by 


D(f\\g) 


=f /log^. 


(8.46) 


Note that D(f\\g) is finite only if the support set of / is contained in 
the support set of g. [Motivated by continuity, we set Olog g = 0.] 


Definition The mutual information I{X\ Y) between two random vari¬ 
ables with joint density f(x, y) is defined as 

I(X; y) = f f{x,y) log /(.H) 、 dxdy. (8.47) 


From the definition it is clear that 


/(X;F) 

and 


= h(X)- h(X\Y) = h{Y) - h{Y\X) = h{X) + h{Y) - h(X, Y) 

(8.48) 

I(X;Y) = D(f(x, y)\\f(x)f(y)). (8.49) 


The properties of D(f\\g) and I{X\ Y) are the same as in the dis¬ 
crete case. In particular, the mutual information between two random 
variables is the limit of the mutual information between their quantized 
versions, since 

/(X A ; Y A ) = ff(X A ) - H(X a \Y a ) (8.50) 

^ h(X) - log A - (/z(Z|F) - log A) (8.51) 
=/(X; Y). (8.52) 

More generally, we can define mutual information in terms of finite 
partitions of the range of the random variable. Let X be the range of a 
random variable X. A partition P of ^ is a finite collection of disjoint 
sets Pi such that U/P/ = X. The quantization of X by P (denoted [X]^p) 
is the discrete random variable defined by 

Pr([X] P = /) = Pr(X g Pi) = I dF(x). (8.53) 

JPi 

For two random variables X and Y with partitions V and Q, we can 
calculate the mutual information between the quantized versions of X 
and Y using (2.28). Mutual information can now be defined for arbitrary 
pairs of random variables as follows: 
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Definition The mutual information between two random variables X 
and Y is given by 

I(X;Y) = supI([X] v ;[Y] Q ), (8.54) 

V,Q 


where the supremum is over all finite partitions V and Q. 

This is the master definition of mutual information that always applies, 
even to joint distributions with atoms, densities, and singular parts. More¬ 
over, by continuing to refine the partitions P and Q, one finds a mono- 
tonically increasing sequence I([X] P ] [F]^) ^ I. 

By arguments similar to (8.52), we can show that this definition of 
mutual information is equivalent to (8.47) for random variables that have 
a density. For discrete random variables, this definition is equivalent to 
the definition of mutual information in (2.28). 


Example 8.5.1 (Mutual information between correlated Gaussian ran¬ 
dom variables with correlation p) Let (X, l 7 ) 〜 Af(0, K), where 


K = 


a 

pa 7 


a 1 


(8.55) 


Then h{X) = h{Y) = ^log(2ne)a 2 and h(X, Y) = ^ \og{2ne) 2 \K\ = 
j log(27re) 2 cr 4 (l — p 2 ), and therefore 

/(X;Y) = h(X) + h(Y)~ h(X, Y) = -Uog(l-p 2 ). (8.56) 


If p = 0, X and Y are independent and the mutual information is 0. 
If p = 士 1， X and Y are perfectly correlated and the mutual information 
is infinite. 


8.6 PROPERTIES OF DIFFERENTIAL ENTROPY, RELATIVE 
ENTROPY, AND MUTUAL INFORMATION 

Theorem 8.6.1 


D(f\\g)>0 

with equality iff f = g almost everywhere (a.e.). 
Proof: Let S be the support set of /. Then 


-D(f\\g) = log 


g 


叫 A 


(by Jensen’s inequality) 


(8.57) 


(8.58) 

(8.59) 
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=log J^g (8.60) 

< log 1 = 0. (8.61) 

We have equality iff we have equality in Jensen’s inequality, which 

occurs iff / = g a.e. □ 

Corollary /(X; 7) > 0 with equality iff X and Y are independent. 

Corollary h(X\Y) < h(X) with equality iff X and Y are independent. 

Theorem 8.6.2 {Chain rule for differential entropy) 


h(X u X 2 ,..., X n ) = X 2 ,..., U. (8.62) 

i = \ 

Proof: Follows directly from the definitions. □ 

Corollary 

h(X u X 2 ,...,X n ) < J2 h ( x i^ (8.63) 

with equality iff X\, X 2 , … ， X n are independent. 

Proof: Follows directly from Theorem 8.6.2 and the corollary to Theo¬ 
rem 8.6.1. □ 


Application (Hadamard’s inequality) If we let X 〜 K) be a mul¬ 
tivariate normal random variable, calculating the entropy in the above 
inequality gives us 

n 

m <]!(,■， (8-64) 

i=\ 

which is Hadamard’s inequality. A number of determinant inequalities 
can be derived in this fashion from information-theoretic inequalities 
(Chapter 17). 


Theorem 8.6.3 


h(X + c) = h(X). 


(8.65) 


Translation does not change the differential entropy. 

Proof: Follows directly from the definition of differential entropy. □ 
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Theorem 8.6.4 


h(aX) = h(X) + \og\a\. 


Proof: Let Y = aX. Then f Y (y) = |^j/x(^), and 

h(aX) = - f f Y (y) log f Y (y) dy 

-fx log 

\a\ J \a) B 

=-J fx(x) log fx(x) dx + log \a\ 
= h(X) + log\a\, 

after a change of variables in the integral. 


/ 


w\ fx ^>) dy 


( 8 . 66 ) 


(8.67) 

( 8 . 68 ) 

(8.69) 

(8.70) 
□ 


Similarly, we can prove the following corollary for vector-valued ran¬ 
dom variables. 


Corollary 


h(AX) = h(X) + log |det(A)|. 


(8.71) 


We now show that the multivariate normal distribution maximizes the 
entropy over all distributions with the same covariance. 

Theorem 8.6.5 Let the random vector X e R n have zero mean and 
covariance K = EXX r (Le” Ky = EXiXj, 1 < i, j < n). Then h(X) < 
j \og(2ne) n \K\ y with equality iffX 〜 K). 

Proof: Let g(x) be any density satisfying / g(x)xiXjdx = K" for all 
i, j. Let 4>k be the density of a A/*(0, K) vector as given in (8.35), where we 
set /x = 0. Note that log 0a ： (x) is a quadratic form and f XiXj(pK (x) dx = 
Kij. Then 

0<D(g\\cp K ) (8.72) 

=J glog(g/^) (8.73) 


=—Mg) _ / g log <t)K 


(8.74) 
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= -Kg) - J 4> k log 4>k (8.75) 

= -h(g) + _ K )， (8.76) 

where the substitution f g log (j)K = f (t>K log 4>k follows from the fact 
that g and yield the same moments of the quadratic form log 0 a ： (x). 

□ 

In particular, the Gaussian distribution maximizes the entropy over 
all distributions with the same variance. This leads to the estimation 
counterpart to Fano’s inequality. Let X be a random variable with differ¬ 
ential entropy h{X). Let X be an estimate of X, and let E(X — X) 2 be 
the expected prediction error. Let h(X) be in nats. 

Theorem 8.6.6 (Estimation error and differential entropy) For any 
random variable X and estimator X, 

E(X - X) 2 > — e 2h(x \ 

2ne 

with equality if and only if X is Gaussian and X is the mean of X. 

Proof: Let X be any estimator of X\ then 


E(X - X) 2 > min E(X - X) 2 

(8.77) 

=E (X — E(X)) 2 

(8.78) 

= \ar{X) 

(8.79) 

> 1 

- 2jze ， 

(8.80) 


where (8.78) follows from the fact that the mean of X is the best estimator 
for X and the last inequality follows from the fact that the Gaussian 
distribution has the maximum entropy for a given variance. We have 
equality only in (8.78) only if X is the best estimator (i.e., X is the mean 
of X and equality in (8.80) only if X is Gaussian). □ 

Corollary Given side information Y and estimator X{Y), it follows that 

E(X — X{Y)) 2 > J— e 2HX\Y) 

2ne 
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SUMMARY 


h{X) = h(f) = - J /(x) log f(x)dx 

(8.81) 

f(X n )=2~ nh(x) 

(8.82) 

Vol(4"))=2 雄). 

(8.83) 

H ([X] 2 -n) ^h(X) + n. 

(8.84) 

h(Af(0, a 2 )) = ^ log 2nea 2 . 

(8.85) 

h{M n {ix,K)) = l -\og(2ne) n \K\. 

(8.86) 

D(f\\g) = J f\og^>0. 

(8.87) 

n 

h{X u X 2 ,...,X n ) = Y J 购义1，义2, …， 足 -l). 

(8.88) 

h(X\Y)<h(X). 

(8.89) 

h(aX) = h(X) + \og\a\. 

(8.90) 

I(X-,Y)= f f(x,y)\og >0- 

J / 0)/00 

(8.91) 

max h(X) = - \og{2ne) n \K\. 

EXX}=K 2 

E(X X(7)) 2 > 1 e 2h(xlY) . 

2ne 

(8.92) 

2 nH(X) 此 effective alphabet size for a discrete random variable. 
2 n/z ( x ) is the effective support set size for a continuous random variable. 

2 C is the effective alphabet size of a channel of capacity C. 



PROBLEMS 

8.1 Differential entropy. Evaluate the differential entropy h(X)= 
—f / In / for the following: 

(a) The exponential density, f{x) = , x > 0. 
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(b) The Laplace density, f(x) = 

(c) The sum of X\ and X 2 , where X\ and X 2 are independent 
normal random variables with means /x/ and variances <r?, i = 
1 , 2 . 

8.2 Concavity of determinants • Let K\ and K 2 be two symmetric non¬ 
negative definite n x n matrices. Prove the result of Ky Fan [199]: 

I 入 ^+1 夂 2 | 之 | A I 入 I 欠 2 I 1 for OS 入 $1 ， X=1 — 入， 

where | K \ denotes the determinant of K. [Hint: Let X = \q, 
where Xi 〜 N(Q, K\), X 2 ~ 7V(0, K 2 ) and 6 = Bernoulli (入) .Then 
use h(Z \ 9) < h(Z).] 

8.3 Uniformly distributed noise . Let the input random variable X to 
a channel be uniformly distributed over the interval -J<X < + 臺 . 
Let the output of the channel be Y = X + Z, where the noise ran¬ 
dom variable is uniformly distributed over the interval —a/2 < z < 

(a) Find /(X; F) as a function of a. 

(b) For a = l find the capacity of the channel when the input X 
is peak-limited; that is, the range of X is limited to < x < 
+i. What probability distribution on X maximizes the mutual 
information Y)1 

(c) (Optional) Find the capacity of the channel for all values of a, 
again assuming that the range of X is limited to -J <X < + 士 . 

8.4 Quantized random variables. Roughly how many bits are required 
on the average to describe to three-digit accuracy the decay time 
(in years) of a radium atom if the half-life of radium is 80 years? 
Note that half-life is the median of the distribution. 

8.5 Scaling. Let h(X) = — f /(x) log/(x) dx. Show 
h(AX) = log I det(A) I +h(X). 

8.6 Variational inequality . Verify for positive random variables X 
that 

log Ep(X) = sup[£ 0 (logZ) -D(Q\\P)], (8.93) 

Q 

where E P (X) = J ： x xP(x) and D(Q\\P) = J： X Q(x) log 
and the supremum is over all Q(x)>0, Q{x) = 1. It is enough 

toextremize J(Q) = E Q lnX-D(Q\\P)+X(J2 
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8.7 Differential entropy bound on discrete entropy. Let X be a dis¬ 
crete random variable on the set X = {a\, a 2 , ...} with Pr(Z = 
a{) = pi. Show that 

1 ( °° 

H(pu P 2 , ...) < - log(277-e) I 

Moreover, for every permutation a, 

1 ( °° 

H(pu P2, • • •) < - \og(2ne) I ^ pW 

(8.95) 

[Hint: Construct a random variable X f such that = i) = pi ， 
Let [/ be a uniform (0,1] random variable and let Y = X f + U, 
where X r and U are independent. Use the maximum entropy bound 
on Y to obtain the bounds in the problem. This bound is due to 
Massey (unpublished) and Willems (unpublished).] 

8.8 Channel with uniformly distributed noise. Consider a additive 
channel whose input alphabet X = {0, 土 1, 士 2} and whose output 

Y = X+Z, where Z is distributed uniformly over the interval 
[—1,1]. Thus, the input of the channel is a discrete random vari¬ 
able, whereas the output is continuous. Calculate the capacity C = 
maxp^ x ) Y) of this channel. 

8.9 Gaussian mutual information. Suppose that (X, F, Z) are jointly 
Gaussian and that X —> F —> Z forms a Markov chain. Let X and 

Y have correlation coefficient p\ and let Y and Z have correlation 
coefficient p 2 . Find I(X; Z). 

8.10 Shape of the typical set. Let Xi be i.i.d. 〜 f(x), where 

f(x) = ce~ x \ 

Let h = — J f\n f. Describe the shape (or form) or the typical set 

A (n) = | x « £ U n . f (x n )e 之 - 啡士 ’. 

8.11 Nonergodic Gaussian process . Consider a constant signal V in 
the presence of iid observational noise {Z/}. Thus, Xi = V + 
where V 〜 A^(0, S) and Z/ are iid 〜 A^(0, N). Assume that V and 
{Z/} are independent. 

(a) Is {Z/} stationary? 
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(b) Find lim„ — >0Q ^ Y^!i=\ Xi. Is the limit random? 

(c) What is the entropy rate h of {X/}? 

(d) Find the least-mean-squared error predictor X 时 i (X n )，and find 

= lim w _^oo E(X n - X n ) 2 . 

(e) Does {Z/} have an AEP? That is, does — ^ log f(X n ) — ^ hi 

HISTORICAL NOTES 

Differential entropy and discrete entropy were introduced in Shannon’s 
original paper [472]. The general rigorous definition of relative entropy 
and mutual information for arbitrary random variables was developed by 
Kolmogorov [319] and Pinsker [425], who defined mutual information as 
sup-p q I([X]p ； [F]q), where the supremum is over all finite partitions V 
and Q. 



CHAPTER 9 

GAUSSIAN CHANNEL 


The most important continuous alphabet channel is the Gaussian channel 
depicted in Figure 9.1. This is a time-discrete channel with output Y[ at 
time /, where Y[ is the sum of the input X/ and the noise Z/. The noise 
Z/ is drawn i.i.d. from a Gaussian distribution with variance N. Thus, 

Yi = Xi + Z,-, Zi 〜 (9.1) 


The noise Z/ is assumed to be independent of the signal Xi. This channel 
is a model for some common communication channels, such as wired and 
wireless telephone channels and satellite links. Without further conditions, 
the capacity of this channel may be infinite. If the noise variance is zero, 
the receiver receives the transmitted symbol perfectly. Since X can take 
on any real value, the channel can transmit an arbitrary real number with 
no error. 

If the noise variance is nonzero and there is no constraint on the input, 
we can choose an infinite subset of inputs arbitrarily far apart, so that 
they are distinguishable at the output with arbitrarily small probability of 
error. Such a scheme has an infinite capacity as well. Thus if the noise 
variance is zero or the input is unconstrained, the capacity of the channel 
is infinite. 

The most common limitation on the input is an energy or power constraint. 
We assume an average power constraint. For any codeword (x\, X 2 ,..., x n ) 
transmitted over the channel, we require that 

(9.2) 

i=l 

This communication channel models many practical channels, includ¬ 
ing radio and satellite links. The additive noise in such channels may be 
due to a variety of causes. However, by the central limit theorem, the 
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Zi 



cumulative effect of a large number of small random effects will be 
approximately normal, so the Gaussian assumption is valid in a large 
number of situations. 

We first analyze a simple suboptimal way to use this channel. Assume 
that we want to send 1 bit over the channel in one use of the channel. 
Given the power constraint, the best that we can do is to send one of 
two levels, - or —\fP. The receiver looks at the corresponding Y 
received and tries to decide which of the two levels was sent. Assuming 
that both levels are equally likely (this would be the case if we wish to 
send exactly 1 bit of information), the optimum decoding rule is to decide 
that +V~P was sent if F > 0 and decide —\f~P was sent if F < 0. The 
probability of error with such a decoding scheme is 

Pe = ^ MY < 0|X = +VP) + ^ Pr(y > 0|X = ~Vp) (9.3) 

=• Pr(Z < -VP\X = +VP) + i Pr(Z > JT\X = -JT) (9.4) 

=Pr(Z > J~P) (9.5) 

=1 - cj> (/pT^v), (9.6) 

where <E>(x) is the cumulative normal function 

/ X 1 _ t 2 

—=e^~dt. (9.7) 

-oo v27r 

Using such a scheme, we have converted the Gaussian channel into a dis¬ 
crete binary symmetric channel with crossover probability P e . Similarly, 
by using a four-level input signal, we can convert the Gaussian channel 
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into a discrete four-input channel. In some practical modulation schemes, 
similar ideas are used to convert the continuous channel into a discrete 
channel. The main advantage of a discrete channel is ease of processing 
of the output signal for error correction, but some information is lost in 
the quantization. 

9.1 GAUSSIAN CHANNEL: DEFINITIONS 

We now define the (information) capacity of the channel as the maxi¬ 
mum of the mutual information between the input and output over all 
distributions on the input that satisfy the power constraint. 

Definition The information capacity of the Gaussian channel with 
power constraint P is 

C = max /(X; Y). (9.8) 

f(x):EX 2 <P 

We can calculate the information capacity as follows: Expanding 
/(X; F), we have 

I{X\Y) = h{Y)-h{Y\X) (9.9) 

=h(Y) - h(X + Z\X) (9.10) 

=h(Y) - h(Z\X) (9.11) 

= h(Y)-h(Z), (9.12) 

since Z is independent of X. Now, h(Z) = ^ log 2neN. Also, 

EY 2 = E(X + Z) 2 = EX 2 + 2EXEZ + EZ 2 = P + N, (9.13) 

since X and Z are independent and EZ = 0. Given EY 2 = P + N, the 
entropy of Y is bounded by ^log2ne(P + N) by Theorem 8.6.5 (the 
normal maximizes the entropy for a given variance). 

Applying this result to bound the mutual information, we obtain 

I(X]Y) = h(Y)-h(Z) (9.14) 

1 1 

< - log 2ne(P + N) — - log 2neN (9.15) 

log (1 + |). (9.16) 
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and the maximum is attained when X 〜 A/(0, P). 

We will now show that this capacity is also the supremum of the rates 
achievable for the channel. The arguments are similar to the arguments 
for a discrete channel. We will begin with the corresponding definitions. 

Definition An (M, n) code for the Gaussian channel with power con¬ 
straint P consists of the following: 

1. An index set {1,2,..., M). 

2. An encoding function x : {1,2,..., M] X n , yielding codewords 
x n (l) 9 x n (2 ),..., x n (M), satisfying the power constraint P\ that is, 
for every codeword 

n 

w = l,2,...,M. (9.18) 

i=l 

3. A decoding function 

{1 ， 2, … ， M}. (9.19) 

The rate and probability of error of the code are defined as in Chapter 7 
for the discrete case. The arithmetic average of the probability of error is 
defined by 

= ( 9 _ 20 ) 

Definition A rate R is said to be achievable for a Gaussian channel 

with a power constraint P if there exists a sequence of (2 nR , n) codes 
with codewords satisfying the power constraint such that the maximal 
probability of error 入⑻ tends to zero. The capacity of the channel is the 
supremum of the achievable rates. 

Theorem 9.1.1 The capacity of a Gaussian channel with power con¬ 
straint P and noise variance N is 

1 ( P\ 

C = - log I 1 + — I bits per transmission. (9.21) 


Hence, the information capacity of the Gaussian channel is 


、— \ 

( 9 . 




og 

1 -2 
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Remark We first present a plausibility argument as to why we may be 
able to construct (2 ,lC , n) codes with a low probability of error. Consider 
any codeword of length n. The received vector is normally distributed with 
mean equal to the true codeword and variance equal to the noise variance. 
With high probability, the received vector is contained in a sphere of radius 
^n(N + 6) around the true codeword. If we assign everything within this 
sphere to the given codeword, when this codeword is sent there will be 
an error only if the received vector falls outside the sphere, which has 
low probability. 

Similarly, we can choose other codewords and their corresponding 
decoding spheres. How many such codewords can we choose? The vol¬ 
ume of an n-dimensional sphere is of the form C n r n , where r is the 
radius of the sphere. In this case, each decoding sphere has radius \Jn~N. 
These spheres are scattered throughout the space of received vectors. The 
received vectors have energy no greater than n(P + N), so they lie in a 
sphere of radius \Jn{P + N). The maximum number of nonintersecting 
decoding spheres in this volume is no more than 



and the rate of the code is 士 log(l + ;)• This idea is illustrated in Figure 9.2. 



FIGURE 9.2. Sphere packing for the Gaussian channel. 
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This sphere-packing argument indicates that we cannot hope to send 
at rates greater than C with low probability of error. However, we can 
actually do almost as well as this, as is proved next. 

Proof: (Achievability). We will use the same ideas as in the proof of 
the channel coding theorem in the case of discrete channels: namely, 
random codes and joint typicality decoding. However, we must make 
some modifications to take into account the power constraint and the fact 
that the variables are continuous and not discrete. 


1. Generation of the codebook. We wish to generate a codebook in 

which all the codewords satisfy the power constraint. To ensure 
this, we generate the codewords with each element i.i.d. accord¬ 
ing to a normal distribution with variance P _ €• Since for large 
n, 4 P — the probability that a codeword does not sat¬ 

isfy the power constraint will be small. Let Xi(w), i = 1, 2,..., /i, 
w = 1,2,..., 2 nR be i.i.d. 〜 Af(0, P — e), forming codewords 
X n (l), X n (2),..., X n (2 nR ) e TZ n . 

2. Encoding. After the generation of the codebook, the codebook is 
revealed to both the sender and the receiver. To send the message 
index w, the transmitter sends the wth codeword X n {w) in the code¬ 
book. 

3. Decoding. The receiver looks down the list of codewords {X n (w)} 
and searches for one that is jointly typical with the received vector. 
If there is one and only one such codeword X n (w), the receiver 
declares W = w to be the transmitted codeword. Otherwise, the 
receiver declares an error. The receiver also declares an error if the 
chosen codeword does not satisfy the power constraint. 

4. Probability of error. Without loss of generality, assume that code¬ 
word 1 was sent. Thus, Y n = X n (l) + Z n . Define the following 
events: 


and 


Eo 




(9.23) 


Ei = {(X n (i), Y n ) is in A^}. (9.24) 


Then an error occurs if Eo occurs (the power constraint is violated) 
or E c x occurs (the transmitted codeword and the received sequence 
are not jointly typical) or £2 U £"3 U • • • U E 2 nR occurs (some wrong 
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codeword is jointly typical with the received sequence). Let £ denote 
the event W ^ W and let P denote the conditional probability given 
that W = l. Hence, 

Pr(£\W = 1) = P{£) = P (五 o U U £ 2 U U ... U E v r) 

(9.25) 


^nR 


< 


PiE^ + P^ED + Y.P^i), (9.26) 


by the union of events bound for probabilities. By the law of large 
numbers, P(Eo) —> 0 as n ^ oc. Now, by the joint AEP (which 
can be proved using the same argument as that used in the discrete 
case), P{E C X ) 0, and hence 


P{E C X ) < 6 for n sufficiently large. 


(9.27) 


Since by the code generation process, X n (l) and X n {i) are indepen¬ 
dent, so are Y n and X n {i). Hence, the probability that X n (i) and Y n 
will be jointly typical is < 2- n (’ ( x;y ) - 3e) by the joint AEP. 

Now let W be uniformly distributed over {1,2,..., 2 nR }, and con¬ 
sequently, 


Then 


Pr (。= w D 入 , . = &(”)• 


P” = Pr(«f) = Vx{8\W = 1) 

< P(E 0 ) + P(E c l ) + J2P(E i ) 


(9.28) 

(9.29) 

(9.30) 


尺 

< 6+6 + 2 _n(/(x；y)_3€) 
i=2 

= 2e + {2 nR - l) 2 _w(/(X；y)_3€) 

<2e + 2 3n€ 2~ n(I(x;Y) ~ R) 

< 36 


(9.31) 

(9.32) 

(9.33) 

(9.34) 
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for n sufficiently large and R < I(X; Y) — 36. This proves the exis¬ 
tence of a good (2 nR , n) code. 

Now choosing a good codebook and deleting the worst half of the 
codewords, we obtain a code with low maximal probability of error. In 
particular, the power constraint is satisfied by each of the remaining code¬ 
words (since the codewords that do not satisfy the power constraint have 
probability of error 1 and must belong to the worst half of the codewords). 
Hence we have constructed a code that achieves a rate arbitrarily close to 
capacity. The forward part of the theorem is proved. In the next section 
we show that the achievable rate cannot exceed the capacity. □ 

9.2 CONVERSE TO THE CODING THEOREM FOR GAUSSIAN 
CHANNELS 

In this section we complete the proof that the capacity of a Gaussian 
channel is C = | log(l + -^) by proving that rates /? > C are not achiev¬ 
able. The proof parallels the proof for the discrete channel. The main new 
ingredient is the power constraint. 

Proof: (Converse to Theorem 9.1.1). We must show that if 尸 j 77 ) w 0 for 
a sequence of (2 nR , n) codes for a Gaussian channel with power constraint 
P ， then 

hC = ^log(l + g). (9.35) 

Consider any (2 nR , n) code that satisfies the power constraint, that is, 

1 n 

< P, (9.36) 

U i=l 

for w; = 1, 2, … ， 2 nR • Proceeding as in the converse for the discrete case, 
let W be distributed uniformly over {1,2,..., 2 nR }. The uniform distri¬ 
bution over the index set W e {1,2,..., 2 nR } induces a distribution on 
the input codewords, which in turn induces a distribution over the input 
alphabet. This specifies a joint distribution on W ^ X n (W) Y n ^ W. 
To relate probability of error and mutual information, we can apply Fano，s 
inequality to obtain 


H(W\W) < l+nRP e (n) =ne n , 


(9.37) 
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where ^ 0 as > 0. Hence, 

nR = H(W) = 1(W\ W) + H{W\W) (9.38) 

< liW\ W) + ne n (9.39) 

<I(X n ;Y n ) + ne n (9.40) 

= h{Y n )~ h(Y n \X n ) + ne n (9.41) 

=h(Y n ) - h(Z n ) + n^ n (9.42) 

n 

< ^2h(Yi)-h(Z n ) + ne n (9.43) 

i=\ 

n n 

= J2 h ( Y <)-J2 h ^ + ne - ( 9 - 44 ) 

i=l i=l 

n 

= J^I(X r ,Y i ) + ne n . (9.45) 

i=l 

Here Xi = Xi(W), where W is drawn according to the uniform distribution 
on {1,2,..., 2 nR }. Now let Pi be the average power of the /th column of 
the codebook, that is, 

Pi = - (9 . 46) 

U) 

Then, since F/ = Xi + Z/ and since Xi and Z/ are independent, the aver¬ 
age power EYi 2 of F/ is Pj + N. Hence, since entropy is maximized by 
the normal distribution, 

h(Yt) < ilog2^K^ + ^V). (9.47) 

Continuing with the inequalities of the converse, we obtain 

nR < ) — HZi)) + ne n (9.48) 

< ^ Q log(27re(P/ + AO) — - log 277- +ne„ (9.49) 
= - log ^1 + + ne„. (9.50) 


270 


GAUSSIAN CHANNEL 


Since each of the codewords satisfies the power constraint, so does their 
average, and hence 


(9 _ 51) 


Since f(x) = 士 log(l + x) is a concave function of x, we can apply 
Jensen’s inequality to obtain 


i n i 



1 ( P 、 

- 2 l0g ( 1 + ^ 


(9.52) 

(9.53) 


Thus R < ^ log(l + 夯 ） + Q，— 0, and we have the required converse. 
Note that the power constraint enters the standard proof in (9.46). 


9.3 BANDLIMITED CHANNELS 

A common model for communication over a radio network or a telephone 
line is a bandlimited channel with white noise. This is a continuous¬ 
time channel. The output of such a channel can be described as the 
convolution 


Y(t) = (X(t) + Z(t)) * h(t), (9.54) 

where X{t) is the signal waveform, Z{t) is the waveform of the white 
Gaussian noise, and h{t) is the impulse response of an ideal bandpass 
filter, which cuts out all frequencies greater than W. In this section we 
give simplified arguments to calculate the capacity of such a channel. 

We begin with a representation theorem due to Nyquist [396] and Shan¬ 
non [480], which shows that sampling a bandlimited signal at a sampling 
rate ^ is sufficient to reconstruct the signal from the samples. Intuitively, 
this is due to the fact that if a signal is bandlimited to W, it cannot change 
by a substantial amount in a time less than half a cycle of the maximum 
frequency in the signal, that is, the signal cannot change very much in 
time intervals less than ^ seconds. 
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Theorem 9.3.1 Suppose that a function f{t) is bandlimited to W, 
namely，the spectrum of the function is 0 for all frequencies greater than 
W. Then the function is completely determined by samples of the function 
spaced 去 seconds apart. 

Proof: Let F{o)) be the Fourier transform of f{t). Then 


/(0 = 忐 , 

/»oo 

/ F(co)e iwt dw 

'— OO 

(9.55) 

1 

r>2nW 



/ F{co)e i0)t dco, 

(9.56) 

2tt t 

l~2nW 



since F{co) is zero outside the band —2jzW < oj <2ttW. If we consider 
samples spaced ^ seconds apart, the value of the signal at the sample 
points can be written 

f (^) = y- 广 W F(co)e io} ^ dco. (9.57) 

2n J- 27 TW 


The right-hand side of this equation is also the definition of the coefficients 
of the Fourier series expansion of the periodic extension of the function 
taking the interval —2jtW to 2n W as the fundamental period. Thus, 
the sample values /(^) determine the Fourier coefficients and, by exten¬ 
sion, they determine the value of F(co) in the interval (—2nW, 2 ttW). 
Since a function is uniquely specified by its Fourier transform, and since 
F(co) is zero outside the band W, we can determine the function uniquely 
from the samples. 

Consider the function 


sinc(0 = 


sin(27r Wt) 
2nWt 


(9.58) 


This function is 1 at ，= 0 and is 0 for t = n/2W, n ^ 0. The spectrum 
of this function is constant in the band (-W, W) and is zero outside this 
band. Now define 

oo 

洲 = E .)• ( 9 . 59 ) 

n=—oo 


From the properties of the sine function, it follows that g{t) is bandlim¬ 
ited to W and is equal to f(n/2W) at t = n/2W. Since there is only 
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one function satisfying these constraints, we must have g(t) = f{t). This 
provides an explicit representation of f{t) in terms of its samples. □ 

A general function has an infinite number of degrees of freedom — the 
value of the function at every point can be chosen independently. The 
Nyquist- Shannon sampling theorem shows that a bandlimited function 
has only 2W degrees of freedom per second. The values of the function 
at the sample points can be chosen independently, and this specifies the 
entire function. 

If a function is bandlimited, it cannot be limited in time. But we can 
consider functions that have most of their energy in bandwidth W and 
have most of their energy in a finite time interval, say (0, T). We can 
describe these functions using a basis of prolate spheroidal functions. We 
do not go into the details of this theory here; it suffices to say that there 
are about 2TW orthonormal basis functions for the set of almost time- 
limited, almost bandlimited functions, and we can describe any function 
within the set by its coordinates in this basis. The details can be found 
in a series of papers by Landau, Poliak, and Slepian [340, 341, 500]. 
Moreover, the projection of white noise on these basis vectors forms 
an i.i.d. Gaussian process. The above arguments enable us to view the 
bandlimited, time-limited functions as vectors in a vector space of 2TW 
dimensions. 

Now we return to the problem of communication over a bandlimited 
channel. Assuming that the channel has bandwidth W, we can represent 
both the input and the output by samples taken 1/2W seconds apart. Each 
of the input samples is corrupted by noise to produce the corresponding 
output sample. Since the noise is white and Gaussian, it can be shown 
that each noise sample is an independent, identically distributed Gaussian 
random variable. 

If the noise has power spectral density No/2 watts/hertz and bandwidth 
W hertz, the noise has power ^-2W = NqW and each of the 2WT noise 
samples in time T has variance NqWT/ 2WT = No/2. Looking at the 
input as a vector in the 2TW -dimensional space, we see that the received 
signal is spherically normally distributed about this point with covariance 

Now we can use the theory derived earlier for discrete-time Gaussian 
channels, where it was shown that the capacity of such a channel is 




Let the channel be used over the time interval [0, T]. In this case, the 
energy per sample is PT/2WT = P/2W, the noise variance per sample 
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is ^2W 



= A^o/2, and hence the capacity per sample is 


C = |log 



bits per sample. 


(9.61) 


Since there are 2W samples each second, the capacity of the channel can 
be rewritten as 




This equation is one of the most famous formulas of information theory. It 
gives the capacity of a bandlimited Gaussian channel with noise spectral 
density No/2 watts/Hz and power P watts. 

A more precise version of the capacity argument [576] involves con¬ 
sideration of signals with a small fraction of their energy outside the 
bandwidth W of the channel and a small fraction of their energy outside 
the time interval (0, T). The capacity above is then obtained as a limit as 
the fraction of energy outside the band goes to zero. 

If we let W —> oo in (9.62), we obtain 

P 

C = —— log 2 ^ bits per second (9.63) 

Nq 

as the capacity of a channel with an infinite bandwidth, power P, and 
noise spectral density No/2. Thus, for infinite bandwidth channels, the 
capacity grows linearly with the power. 

Example 9.3.1 (Telephone line) To allow multiplexing of many chan¬ 
nels, telephone signals are bandlimited to 3300 Hz. Using a bandwidth of 
3300 Hz and a SNR (signal-to-noise ratio) of 33 dB (i.e., P/NqW = 
2000) in (9.62), we find the capacity of the telephone channel to be 
about 36,000 bits per second. Practical modems achieve transmission rates 
up to 33,600 bits per second in both directions over a telephone channel. 
In real telephone channels, there are other factors, such as crosstalk, inter¬ 
ference, echoes, and nonflat channels which must be compensated for to 
achieve this capacity. 

The V.90 modems that achieve 56 kb/s over the telephone channel 
achieve this rate in only one direction, taking advantage of a purely digital 
channel from the server to final telephone switch in the network. In this 
case, the only impairments are due to the digital-to-analog conversion at 
this switch and the noise in the copper link from the switch to the home; 
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these impairments reduce the maximum bit rate from the 64 kb/s for the 
digital signal in the network to the 56 kb/s in the best of telephone lines. 

The actual bandwidth available on the copper wire that links a home to 
a telephone switch is on the order of a few megahertz; it depends on the 
length of the wire. The frequency response is far from flat over this band. 
If the entire bandwidth is used, it is possible to send a few megabits per 
second through this channel; schemes such at DSL (Digital Subscriber 
Line) achieve this using special equipment at both ends of the telephone 
line (unlike modems, which do not require modification at the telephone 
switch). 

9.4 PARALLEL GAUSSIAN CHANNELS 

In this section we consider k independent Gaussian channels in parallel 
with a common power constraint. The objective is to distribute the total 
power among the channels so as to maximize the capacity. This channel 
models a nonwhite additive Gaussian noise channel where each parallel 
component represents a different frequency. 

Assume that we have a set of Gaussian channels in parallel as illustrated 
in Figure 9.3. The output of each channel is the sum of the input and 
Gaussian noise. For channel j, 

Yj = Xj + Zj, ) = 1,2 , …上 (9.64) 

with 

Zj 〜龜 Nj\ (9.65) 

and the noise is assumed to be independent from channel to channel. We 
assume that there is a common power constraint on the total power used, 
that is, 


k 

E J2 x2 j - P ' (9.66) 

7=1 

We wish to distribute the power among the various channels so as to 
maximize the total capacity. 

The information capacity of the channel C is 

C= max I(X U X 2 ,...,X k ; Y u Y 2 ,...,Y k ). (9.67) 

f(xi ， x 2 ,...,x k ):J2E Xf<P 
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Z k 



We calculate the distribution that achieves the information capacity for 
this channel. The fact that the information capacity is the supremum of 
achievable rates can be proved by methods identical to those in the proof 
of the capacity theorem for single Gaussian channels and will be omitted. 

Since Zi, Z 2 ,..., are independent, 

I(x u x 2 ,...,x k ； 

= h(Y u Y 2 ,...,Y k )~ h(Y u r 2 , … ， Y k \X u X 2 , … ， Xk) 

=h{Y\, Y 2 , ..., Yk) — h(Z\, Z 2 , ..., Zk\X\, X 2 , ..., X/c) 


=h(Y u F 2 , • • •, Y k ) - h(Z u Z 2 ,..., Z k ) (9.68) 

= h{Y u Y 2 ,...,Y k )~Y J h{Z i ) (9.69) 

i 

(9.70) 

i 

啦峋 ㈣ )， （9. 71 ) 
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J(Pu 


(9.73) 


and differentiating with respect to P“ we have 


2 Pi + Nt 


+ 入 = 0 


or 


Pi = v-Ni. 


(9.74) 


(9.75) 


However, since the Pi’s must be nonnegative, it may not always be possi¬ 
ble to find a solution of this form. In this case, we use the Kuhn-Tucker 
conditions to verify that the solution 


Pi = (v - NiV 


(9.76) 


is the assignment that maximizes capacity, where v is chosen so that 


- Ni) + = P. 

Here (x)+ denotes the positive part of x: 




x if x > 0, 
0 if x < 0. 


(9.77) 


(9.78) 


where Pi = EXf, and Y Pi = P • Equality is achieved by 


{X U X 2 , … ， Xk) 〜 W 


/ 

r Pi 

0 • 

. 0 - 

\ 


0 

尸 2 • 

• . 

. 0 


0 , 




\ 

0 

0 . 

• Pk _ 

/ 


(9.72) 


So the problem is reduced to finding the power allotment that max¬ 
imizes the capacity subject to the constraint that Pi = P. This is a 
standard optimization problem and can be solved using Lagrange multi¬ 
pliers. Writing the functional as 



+ 



og 

112 


E 

II 


This solution is illustrated graphically in Figure 9.4. The vertical levels 
indicate the noise levels in the various channels. As the signal power is 
increased from zero, we allot the power to the channels with the lowest 








9.5 CHANNELS WITH COLORED GAUSSIAN NOISE 


277 


Power 


尸 i 


P2 

A/i _ 

n 2 


Channel 1 Channel 2 


N 3 


Channel 3 


FIGURE 9.4. Water-filling for parallel channels. 

noise. When the available power is increased still further, some of the 
power is put into noisier channels. The process by which the power is 
distributed among the various bins is identical to the way in which water 
distributes itself in a vessel, hence this process is sometimes referred to 
as water-filling. 

9.5 CHANNELS WITH COLORED GAUSSIAN NOISE 

In Section 9.4, we considered the case of a set of parallel independent 
Gaussian channels in which the noise samples from different channels 
were independent. Now we will consider the case when the noise is depen¬ 
dent. This represents not only the case of parallel channels, but also the 
case when the channel has Gaussian noise with memory. For channels 
with memory, we can consider a block of n consecutive uses of the chan¬ 
nel as n channels in parallel with dependent noise. As in Section 9.4, we 
will calculate only the information capacity for this channel. 

Let Kz be the covariance matrix of the noise, and let Kx be the input 
covariance matrix. The power constraint on the input can then be writ- 


ten as 




江吻 p ， 

i 

(9.79) 

or equivalently, 




-ti(K x ) < P. 
n 

(9.80) 
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Unlike Section 9.4, the power constraint here depends on n\ the capacity 
will have to be calculated for each n. 

Just as in the case of independent channels, we can write 


I(X U X 2 , … ， X”; = h(Y u f 2 , ..., Y n ) 

-/z(Zi,Z 2 ,...,Z„). (9.81) 


Here h(Z\, Z 2 ,..., Z n ) is determined only by the distribution of the noise 
and is not dependent on the choice of input distribution. So finding the 
capacity amounts to maximizing h(Y\, Y 2 ,..., Y n ). The entropy of the 
output is maximized when Y is normal, which is achieved when the input 
is normal. Since the input and the noise are independent, the covariance 
of the output Y is Ky = Kx + Kz and the entropy is 


h(Y l ,Y 2 ,..., Y n ) = 1 - log ({2nef\K x + K z \). (9.82) 


Now the problem is reduced to choosing Kx so as to maximize | Kx + 
Kz\, subject to a trace constraint on Kx- To do this, we decompose Kz 
into its diagonal form, 


Kz = QAQ\ where QQ f = 1. 


(9.83) 


Then 


\K x + K z \ = \K x + QAQ t \ 


(9.84) 

(9.85) 

(9.86) 

(9.87) 


= mQ'KxQ + AWQ 
= ie^xG + Ai 
=|A + A|, 


where A = Q l KxQ- Since for any matrices B and C, 


tr(SC) = tr(CB )， 


(9.88) 


we have 


tr(A) = tr(Q t K x Q) 


(9.89) 

(9.90) 

(9.91) 


=tr (QQ r K x ) 
= tr(K x ). 
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Now the problem is reduced to maximizing \A + A| subject to a trace 
constraint tr(A) < nP. 

Now we apply Hadamard’s inequality, mentioned in Chapter 8. Hada- 
mard’s inequality states that the determinant of any positive definite matrix 
K is less than the product of its diagonal elements, that is, 


1^1 < (9-92) 

i 

with equality iff the matrix is diagonal. Thus, 

|A + A| <n^n+^) (9.93) 

i 

with equality iff A is diagonal. Since A is subject to a trace constraint, 

-YAh < P, (9.94) 

n 

i 

and An > 0, the maximum value of ]~[ Z (A// + 入 /) is attained when 

Mi + K = (9.95) 


However, given the constraints, it may not always be possible to satisfy 
this equation with positive An ，In such cases, we can show by the standard 
Kuhn-Tucker conditions that the optimum solution corresponds to setting 

An = (v~Xi) + , (9.96) 

where the water level v is chosen so that ^ An = nP• This value of A 
maximizes the entropy of Y and hence the mutual information. We can 
use Figure 9.4 to see the connection between the methods described above 
and water-filling. 

Consider a channel in which the additive Gaussian noise is a stochas¬ 
tic process with finite-dimensional covariance matrix K^\ If the process 
is stationary, the covariance matrix is Toeplitz and the density of eigen¬ 
values on the real line tends to the power spectrum of the stochastic 
process [262]. In this case, the above water-filling argument translates to 
water-filling in the spectral domain. 

Hence, for channels in which the noise forms a stationary stochastic 
process, the input signal should be chosen to be a Gaussian process with 
a spectrum that is large at frequencies where the noise spectrum is small. 
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F(w) 



This is illustrated in Figure 9.5. The capacity of an additive Gaussian 
noise channel with noise power spectrum N(f) can be shown to be [233] 


C = 




(v-N(f)) + \ 

N(f )~ ) 


df, 


(9.97) 


where v is chosen so that f (v — N(f)) + df = P. 


9.6 GAUSSIAN CHANNELS WITH FEEDBACK 

In Chapter 7 we proved that feedback does not increase the capacity for 
discrete memoryless channels, although it can help greatly in reducing 
the complexity of encoding or decoding. The same is true of an additive 
noise channel with white noise. As in the discrete case, feedback does not 
increase capacity for memoryless Gaussian channels. 

However, for channels with memory, where the noise is correlated 
from time instant to time instant, feedback does increase capacity. The 
capacity without feedback can be calculated using water-filling, but we do 
not have a simple explicit characterization of the capacity with feedback. 
In this section we describe an expression for the capacity in terms of the 
covariance matrix of the noise Z. We prove a converse for this expression 
for capacity. We then derive a simple bound on the increase in capacity 
due to feedback. 

The Gaussian channel with feedback is illustrated in Figure 9.6. The 
output of the channel Y t is 


Yi = Xi + Zi, Zi 〜 


( 9 . 98 ) 
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FIGURE 9.6. Gaussian channel with feedback. 


The feedback allows the input of the channel to depend on the past values 
of the output. 

A (2 nR , n) code for the Gaussian channel with feedback consists of 
a sequence of mappings Xi(W, Y l ~ l ), where IV G {1, 2,, 2 nR } is the 
input message and Y l ~ l is the sequence of past values of the output. Thus, 
x(W, •) is a code function rather than a codeword. In addition, we require 
that the code satisfy a power constraint, 


E -Y^xl{w,r~ l ) < P, w e {\,2,...,2 nR }, (9.99) 


where the expectation is over all possible noise sequences. 

We characterize the capacity of the Gaussian channel is terms of the 
covariance matrices of the input X and the noise Z. Because of the feed¬ 
back, X n and Z n are not independent; Xi depends causally on the past 
values of Z. In the next section we prove a converse for the Gaussian 
channel with feedback and show that we achieve capacity if we take X 
to be Gaussian. 

We now state an informal characterization of the capacity of the channel 
with and without feedback. 

1. With feedback • The capacity C„,fb in bits per transmission of the 
time-varying Gaussian channel with feedback is 


C n ,FB = max 


1 tr (V Vl 
n^ K X 



(9.100) 
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where the maximization is taken over all X n of the form 

i-l 

Xi =J^bijZj + V；, i = 1,2,(9.101) 


and V n is independent of Z n . To verify that the maximization over 
(9.101) involves no loss of generality, note that the distribution on 
X n + Z n achieving the maximum entropy is Gaussian. Since Z n is 
also Gaussian, it can be verified that a jointly Gaussian distribu¬ 
tion on (X n ， Z n , X n + Z n ) achieves the maximization in (9.100). 
But since Z n = Y n — X n , the most general jointly normal causal 
dependence of X n on Y n is of the form (9.101), where V n plays the 
role of the innovations process. Recasting (9.100) and (9.101) using 
X = BZ + V and Y = X + Z, wq can write 


\(B + I)K ( ^(B + iy + K v \ 

i4 ,!) i 


C n ,FB 


max — log 
2n 


(9.102) 


where the maximum is taken over all nonnegative definite Ky and 
strictly lower triangular B such that 

+ K v ) < nP. (9.103) 

Note that S is 0 if feedback is not allowed. 

2. Without feedback. The capacity C n of the time-varying Gaussian 
channel without feedback is given by 

C n = 


max — 

^tr(^ n) )<P 


log 


+^Z I 

|4”)| . 


(9.104) 


This reduces to water-filling on the eigenvalues { 入 ⑻} of Ky 1 . Thus, 


Cn = + 


(入一入 


⑻、 


【 ⑻ 


(9.105) 


where (y) + = max{}；, 0} and where 入 is chosen so that 


E (入-入 




nP. 


(9.106) 
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We now prove an upper bound for the capacity of the Gaussian channel 
with feedback. This bound is actually achievable [136], and is therefore 
the capacity, but we do not prove this here. 

Theorem 9.6.1 For a Gaussian channel with feedback，the rate R n for 
any sequence of (2 nRfl , n) codes with > 0 satisfies 

Rn S C n ^fB + (9.107) 

with e n ^ 0 as n ^ oo, where C n ，FB is defined in (9.100). 

Proof: Let W be uniform over 2 nR , and therefore the probability of error 


is bounded by Fano’s inequality, 

H(W\W) < l+nR u P^ n) =ne n , (9.108) 

where e n — 0 as 尸 e ⑷ 一 0. We can then bound the rate as follows: 

nR n = H{W) (9.109) 

= I{W\ W) + H(W\W) (9.110) 

< I{W\ W) + ne n (9.111) 

< I{W- Y n ) + ne n (9.112) 

= Y / I(W-,Y i \Y i - l ) + ne„ (9.113) 


=_r -1 ) - h{Yi\W, r- 1 ， X,. ， r- 1 ， Z l - l )) + ne n 

(9.114) 

{HYilY 1 - 1 )- h{Zi\W, r— 1 ， X,.’ r'- 1 , Z i '- 1 ))+ne„ 

(9.115) 

竺 [(/KKir- 1 ) —/KZ,#- 1 )) (9.116) 

=h{Y n ) - h{Z n ) + ne n , (9.117) 

where (a) follows from the fact that Z/ is a function of W and the past 
Yi's, and Z’- 1 is Y l ~ l — X l ~ l , (b) follows from 7/ = Xi + Z[ and the 
fact that h{X + Z\X) = h(Z\X), and (c) follows from the fact Z/ and 
(W, y z_1 , X 1 ) are conditionally independent given Z l ~ l . Continuing the 
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chain of inequalities after dividing by n, we have 

Rn < 1 - [h{Y n ) - h{Z n )) + e n (9.118) 

+ . (9.119) 

2 " \K { ^\ 

^ Cn,FB + (n, (9.120) 

by the entropy maximizing property of the normal. □ 


We have proved an upper bound on the capacity of the Gaussian chan- 
nel with feedback in terms of the covariance matrix K x \_ z - We now derive 
bounds on the capacity with feedback in terms of and K^\ which 
will then be used to derive bounds in terms of the capacity without feed¬ 
back. For simplicity of notation, we will drop the superscript n in the 
symbols for covariance matrices. 

We first prove a series of lemmas about matrices and determinants. 
Lemma 9.6.1 Let X and Z be n-dimensional random vectors. Then 

K x+Z + K x _ z = 2K X + 2K Z . (9.121) 


Proof 



K x+Z = E(X + Z)(X + Z) r 

(9.122) 


=EXX 1 + EXZ r + EZX 1 + EZZ r 

(9.123) 


=K x + K xz + K zx + K z . 

(9.124) 

Similarly, 




K x-z = K x _ K xz _ K zx + K z . 

(9.125) 

Adding these two equations completes the proof. 

□ 


Lemma 9.6.2 For two n x n nonnegative definite matrices A and B, if 
A — B is nonnegative definite, then \A\ > \B\. 

Proof: Let C = A — B. Since B and C are nonnegative definite, we 
can consider them as covariance matrices. Consider two independent nor¬ 
mal random vectors Xi 〜 B) and X 2 〜 A/"(0, C). Let Y = Xi + X 2 . 



Then 
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h(Y) > h(Y\X 2 ) 

(9.126) 

= h(X l \X 2 ) 

(9.127) 

=^(Xi), 

(9.128) 


where the inequality follows from the fact that conditioning reduces dif¬ 
ferential entropy, and the final equality from the fact that Xi and X 2 are 
independent. Substituting the expressions for the differential entropies of 


a normal random variable, we obtain 

l -\og(27ie) n \A\ > l -\og{2ne) n \B\, (9.129) 

which is equivalent to the desired lemma. □ 

Lemma 9.6.3 For two n-dimensional random vectors X and Z, 

\^x+z \<2 n \K x + K z \. (9.130) 

Proof: From Lemma 9.6.1, 

2{K X + K z ) - K x+Z = K x _ z >z 0, (9.131) 

where A 匕 0 means that A is nonnegative definite. Hence, applying 
Lemma 9.6.2, we have 

\K X+Z \ < \2(K X + K z )\ = 2 n \K x + K z \, (9.132) 

which is the desired result. □ 

Lemma 9.6.4 For A,B nonnegative definite matrices and 0 $ 入 $ 1, 
|AA + (1 -A)S| > lAI^ISI 1 ^. (9.133) 


Proof: Let X 〜 Af n (0, A) and Y 〜 A/" n (0, B). Let Z be the mixture ran¬ 
dom vector 


x ne = \ 

Y if 0 = 2, 


(9.134) 
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where 


1 with probability 入 

2 with probability 1 — 入 . 


Let X, Y, and 9 be independent. Then 


= 入益 + (1 - X)B. 


(9.135) 


(9.136) 


We observe that 

^ \n{2ne) n \XA + (1 - X)B\ > h(Z) (9.137) 

> h(Z\6) (9.138) 

= Xh(X) + (1 - A)/z(Y) (9.139) 

= ^\n{2ne) n \A\ x \B\ l ~^, (9.140) 

which proves the result. The first inequality follows from the entropy 
maximizing property of the Gaussian under the covariance constraint. □ 

Definition We say that a random vector X n is causally related to Z n if 


n 

f(x n ,z n ) = f(z n ) n fixilx 1 - 1 , (9.141) 



Note that the feedback codes necessarily yield causally related (X n , Z n ). 
Lemma 9.6.5 IfX n and Z n are causally related, then 

h(X n - Z n ) > h(Z n ) (9.142) 

and 


\K X -z\ > \K z l (9.143) 

where Kx-z and Kz are the covariance matrices of X n — Z n and Z n y 
respectively. 

Proof: We have 


h(X n - Z n ) = - Zi\X l ~ l - Z /_1 ) 


(9.144) 
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(b) 

i=l 

(9.145) 

( = 

i=l 

(9.146) 

/ =1 

(9.147) 

= h(Z n ). 

(9.148) 


Here (a) follows from the chain rule, (b) follows from conditioning 
h{A\B) > h(A\B, C), (c) follows from the conditional determinism of 
Xi and the invariance of differential entropy under translation, (d) fol¬ 
lows from the causal relationship of X n and Z n ， and (e) follows from the 
chain rule. 

Finally, suppose that X n and Z n are causally related and the associ¬ 
ated covariance matrices for Z n and X n — Z n are Kz and Kx-z- There 
obviously exists a multivariate normal (causally related) pair of random 
vectors X n , Z n with the same covariance structure. Thus, from (9.148), 
we have 


^ ln(2ne) n \K X -z\ = h(X n - Z n ) (9.149) 

> h(Z n ) (9.150) 

- Un(27te) n \K z \, (9.151) 

thus proving (9.143). □ 

We are now in a position to prove that feedback increases the capacity 
of a nonwhite Gaussian additive noise channel by at most half a bit. 


Theorem 9.6.2 

1 

C n ,FB C n + — 


bits per transmission. 


(9.152) 


Proof: Combining all the lemmas, we obtain 


C n ,FB S 


max 丄 log 四 

tr(K x )<nP 2n \K Z \ 


(9.153) 
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1 2 n \Ky + Ky\ 

< max — log^ - ^ (9.154) 

tUK x )<nP 2n \K Z \ 

1 \K y + K 7 \ 1 

=max — log^ - - + - (9.155) 

tr(K x )<nP 2n \K Z \ 2 

< C n + ^ bits per transmission, (9.156) 

where the inequalities follow from Theorem 9.6.1, Lemma 9.6.3, and the 
definition of capacity without feedback, respectively. □ 


We now prove Pinsker’s statement that feedback can at most double 
the capacity of colored noise channels. 

Theorem 9.6.3 C„,fb < 2C n . 


Proof: It is enough to show that 


i 丄 log ㈣ 

22n B \K Z \ 


< 


2n 


log 


\K X + K Z \ 

\K Z \ 


(9.157) 


for it will then follow that by maximizing the right side and then the left 
side that 


2 


Cn,FB — C n . 


We have 


2n 


log 


\K X + K z \ (a) 1 \\K x +z + \K X -z\ 


\K Z \ 


2n 

(b) 1 


> 


2n 
(C) 1 


> 


_ In 

(d) 1 1 


log 

log 

log 


\K Z \ 

IKx+zl^lKx-zl 1 

\K Z \ 

\Kx ± z^\Kz^_ 

\Kz\ 

\ K x+z\ 


2 2n l0§ \K Z \ 


(9.158) 

(9.159) 

(9.160) 

(9.161) 

(9.162) 


and the result is proved. Here (a) follows from Lemma 9.6.1, (b) is the 
inequality in Lemma 9.6.4, and (c) is Lemma 9.6.5 in which causality is 
used. □ 
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Thus, we have shown that Gaussian channel capacity is not increased 
by more than half a bit or by more than a factor of 2 when we have 
feedback; feedback helps, but not by much. 


SUMMARY 

Maximum entropy. max £Z 2 =Q , h{X) = \ log 2nea. 

Gaussian channel. Y[ = Xi + Z/; Z/ 〜 A/"(0, N)\ power constraint 

7 ； T ： U4<P^ and 


1 ( P 、 

C= 2 XOg V + N, 


bits per transmission. 


(9.163) 


Bandlimited additive white Gaussian noise channel. Bandwidth W\ 
two-sided power spectral density A^o/2; signal power 尸 ； and 


C = Wlog 1 


P 


N^W / 


bits per second. 


(9.164) 


Water-filling (k parallel Gaussian channels). Yj = Xj + Zj, j = l, 

2, …， 炎 ; Z 广 MO, Nj)- E$=1 x )< p ^ and 


C = T log ( 1 + 


(v — Ni ) + 、 


(9.165) 


where v is chosen so that Kv _ Ni) + = nP. 

Additive nonwhite Gaussian noise channel. 7/ = Xi + Z/ ; Z n 

MO, K z )- and 


c = -E2 log ( 1 


(v — 入 /) H 

~aT - 


(9.166) 


where 入 i ， 入 2 ,… ，入 《 are the eigenvalues of Kz and v is chosen so that 


Capacity without feedback 

C n 


max 丄 log ^±^. 
tr(K x )<nP 2n \K Z \ 


(9.167) 
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Capacity with feedback 




Cn,FB — 

1 

max — log 
tY(K x )<nP 2n 

1 

l^zl ' 

(9.168) 

Feedback bounds 

1 

Cn,FB C n + —. 


(9.169) 


C n ,FB S 2C„. 


(9.170) 


PROBLEMS 

9.1 Channel with two independent looks at Y. Let Y\ and Y 2 be condi¬ 
tionally independent and conditionally identically distributed 
given X. 

(a) Show that /(Z; Y u Y 2 ) = 2/(X; Y x ) - I{Y X \ Y 2 ). 

(b) Conclude that the capacity of the channel 

X - ► - (Y„ Y 2 ) 

is less than twice the capacity of the channel 



Consider the ordinary Gaussian channel with two correlated looks 
at X, that is, Y = (Fi, Y 2 ), where 


Y { =X + Z { 
Y 2 = X + Z 2 


(9.171) 

(9.172) 
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with a power constraint P on X, and (Zj, Z 2 ) 〜从 (0, K )，where 


N Np 
Np N 


(9.173) 


Find the capacity C for 

(a) p = 1 

(b) p = 0 

(c) p = -1 

9.3 Output power constraint. Consider an additive white Gaussian 
noise channel with an expected output power constraint P. Thus, 
F = X + Z, Z 〜 7V(0, a 2 ), Z is independent of X, and EY 2 < P. 
Find the channel capacity. 

9.4 Exponential noise channels . Y t = X[ + Z/, where Z z is i.i.d. ex¬ 
ponentially distributed noise with mean /x. Assume that we have 
a mean constraint on the signal (i.e., EXi < 入 ). Show that the 
capacity of such a channel is C = log(l + -). 

/X 

9.5 Fading channel . Consider an additive noise fading channel 



r = xv + z, 

where Z is additive noise, V is a random variable representing 
fading, and Z and V are independent of each other and of X. 
Argue that knowledge of the fading factor V improves capacity by 
showing that 


I(X-Y\V)>I(X; Y). 


9.6 Parallel channels and water-filling . Consider a pair of parallel 
Gaussian channels: 



Zi 

Z 2 



(9.174) 
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where 



〜 J\f 



(9.175) 


and there is a power constraint E{X\ + X\) < 2P. Assume that 
cTj 2 > cr|. At what power does the channel stop behaving like a 
single channel with noise variance c ^，and begin behaving like a 
pair of channels? 

9.7 Multipath Gaussian channel. Consider a Gaussian noise channel 
with power constraint P, where the signal takes two different paths 
and the received noisy signals are added together at the antenna. 



(a) Find the capacity of this channel if Z\ and Z 2 are jointly normal 
with covariance matrix 


(b) What is the capacity forp = 0，p = l,p = —1? 

9.8 Parallel Gaussian channels • Consider the following parallel 
Gaussian channel: 


Zi 〜 A/"(0, AA|) 



Z 2 A/ 2 ) 


X 2 


e 


& 
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where Zi 〜 and Z 2 〜 A/( 0 ,A^ 2 ) are independent Gaussian 
random variables and Y[ = X[ + Z/. We wish to allocate power 
to the two parallel channels. Let and ^2 be fixed. Consider 
a total cost constraint ^\P\ + ^Pi S P, where Pi is the power 
allocated to the /th channel and ^ is the cost per unit power in 
that channel. Thus, Pi > 0 and P 2 >0 can be chosen subject to 
the cost constraint 

(a) For what value of ^ does the channel stop acting like a single 
channel and start acting like a pair of channels? 

(b) Evaluate the capacity and find P\ and P 2 that achieve capacity 
for Pi = 1, )82 — 2, A^i = 3, N 2 — 2, and p = 10. 


9.9 Vector Gaussian channel. 
channel 


Consider the vector Gaussian noise 


r = x + z, 


where X = (X U X 2 ,X 3 ), 
E\\X\\ 2 < P, and 


Z = (Z 1 ,Z 2 ,Z 3 ),V = (Y 1 ,Y 2 ,Y 3 ), 


z 〜川 0, 


Find the capacity. The answer may be surprising. 

9.10 Capacity of photographic film. Here is a problem with a nice 
answer that takes a little time. We’re interested in the capacity 
of photographic film. The film consists of silver iodide crystals, 
Poisson distributed, with a density of 入 particles per square inch. 
The film is illuminated without knowledge of the position of the 
silver iodide particles. It is then developed and the receiver sees 
only the silver iodide particles that have been illuminated. It is 
assumed that light incident on a cell exposes the grain if it is there 
and otherwise results in a blank response. Silver iodide particles 
that are not illuminated and vacant portions of the film remain 
blank. The question is: What is the capacity of this film? 

We make the following assumptions. We grid the film very finely 
into cells of area dA. It is assumed that there is at most one sil¬ 
ver iodide particle per cell and that no silver iodide particle is 
intersected by the cell boundaries. Thus, the film can be consid¬ 
ered to be a large number of parallel binary asymmetric channels 
with crossover probability 1 — XdA. By calculating the capacity of 
this binary asymmetric channel to first order in dA (making the 
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necessary approximations), one can calculate the capacity of the 
film in bits per square inch. It is, of course, proportional to X. The 
question is: What is the multiplicative constant? 

The answer would be 入 bits per unit area if both illuminator and 
receiver knew the positions of the crystals. 

9.11 Gaussian mutual information. Suppose that (X, Y, Z) are jointly 
Gaussian and that X ^ Y ^ Z forms a Markov chain. Let X and 
Y have correlation coefficient p\ and let Y and Z have correlation 
coefficient P 2 . Find /(X; Z). 

9.12 Time-varying channel. A train pulls out of the station at constant 
velocity. The received signal energy thus falls off with time as 
l// 2 . The total received signal at time i is 

Y t = -Xt + Z h 

i 

where Zi, Z 2 , … are i.i.d • 〜 N). The transmitter constraint 
for block length n is 


-Y^x^iw) < p, w g r R ). 

i=\ 

Using Fano’s inequality, show that the capacity C is equal to zero 
for this channel. 

9.13 Feedback capacity. Let (Zi, Z 2 ) 〜 N(Q, K、，K = ^ ^ . 

P 丄 

Find the maximum of 士 log 义冗 ! with and without feedback given 
a trace (power) constraint tr(^x) < 2P. 

9.14 Additive noise channel. Consider the channel Y = X + Z, where 
X is the transmitted signal with power constraint 尸 ， Z is indepen¬ 
dent additive noise, and Y is the received signal. Let 

with probability ^ 
with probability 各， 



where Z* 〜 N(0, N). Thus, Z has a mixture distribution that is 
the mixture of a Gaussian distribution and a degenerate distribution 
with mass 1 at 0. 
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(a) What is the capacity of this channel? This should be a pleasant 
surprise. 

(b) How would you signal to achieve capacity? 

9.15 Discrete input，continuous output channel . Let Pr{X = 1} = p ， 
Pr{X = 0} = 1 — /?， and let Y = X + Z, where Z is uniform over 
the interval [0, a], a > l, and Z is independent of X. 

(a) Calculate 

I(X; Y) = H(X) — H(X\Y). 

(b) Now calculate I{X\ Y) the other way by 

I{X\ Y) =h(Y)-h(Y\X). 

(c) Calculate the capacity of this channel by maximizing over p. 

9.16 Gaussian mutual information • Suppose that (Z, Y, Z) are jointly 
Gaussian and that X Y Z forms a Markov chain. Let X and 
Y have correlation coefficient p\ and let Y and Z have correlation 
coefficient p 2 . Find I{X\ Z). 

9.17 Impulse power. Consider the additive white Gaussian channel 

Zi 

Xi - O - Y ； 

where Z , 〜 N [0, N), and the input signal has average power con¬ 
straint P. 

(a) Suppose that we use all our power at time 1 (i.e., EX\ = nP 
and EXj = 0 for i = 2,3,..., n). Find 

I(X n ; Y n ) 
max - , 

f(x n ) n 


where the maximization is over all distributions f(x n ) subject 
to the constraint EX) = nP and EXf = 0 for i = 2, 3, ... ， n. 
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(b) Find 


max -I(X n ] Y n ) 

/(m(i ELi x f)< p n 


and compare to part (a). 

9.18 Gaussian channel with time-varying mean. Find the capacity of 
the following Gaussian channel: 



Let Zi, Z 2 ,... be independent and let there be a power constraint 
P on x n {W). Find the capacity when: 

(a) [n = 0, for all i. 

(b) fii = e l ， / = 1, 2, .... Assume that is known to the trans¬ 

mitter and receiver. 

(c) \ii unknown, but /x/ i.i.d . 〜 _/V(0, N\) for all i. 


9.19 Parametric form for channel capacity . Consider m parallel Gaus¬ 
sian channels, Y[ = Xi + Z/, where Z z 〜 A^(0, A./) and the noises 
Xi are independent random variables. Thus, C = Y1T=i ilog(l + 

(A ~^) + )，where 入 is chosen to satisfy Y1T=i^ ~ 入 /)+ = 尸 . Show 
that this can be rewritten in the form 

户(入） = E , 

c (入） = J2i ： x t <x 2 log X". 

Here 尸 ( 入） is piecewise linear and C ( 入） is piecewise logarithmic 
in 入 . 

9.20 Robust decoding • Consider an additive noise channel whose out¬ 
put Y is given by 

Y = X + Z, 

where the channel input X is average power limited, 

EX 2 < P, 
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Compare the case where the noise is i.i.d. A/(0, N ) to the case 
at hand. 

(d) Conclude the proof using the fact that the above ensemble of 
codebooks can achieve the capacity of the Gaussian channel 
(no need to prove that). 


and the noise process {Z/ C }^L_ 00 is i.i.d. with marginal distribution 

Pz(z) (not necessarily Gaussian) of power N ， 

EZ 2 = N • 

(a) Show that the channel capacity, C = max^ X 2 <p Y), is 
lower bounded by Cg, where 

(i.e., the capacity Cg corresponding to white Gaussian noise). 

(b) Decoding the received vector to the codeword that is closest to 
it in Euclidean distance is in general suboptimal if the noise is 
non-Gaussian. Show, however, that the rate Cg is achievable 
even if one insists on performing nearest-neighbor decoding 
(minimum Euclidean distance decoding) rather than the optimal 
maximum-likelihood or joint typicality decoding (with respect 
to the true noise distribution). 

(c) Extend the result to the case where the noise is not i.i.d. but is 
stationary and ergodic with power N • 

(Hint for b and c: Consider a size 2 nR random codebook whose 

codewords are drawn independently of each other according to a 

uniform distribution over the n-dimensional sphere of radius \fnP.) 

(a) Using a symmetry argument, show that conditioned on the 
noise vector, the ensemble average probability of error depends 
on the noise vector only via its Euclidean norm ||z||. 

(b) Use a geometric argument to show that this dependence is 
monotonic. 

(c) Given a rate R < Cg, choose some N f > N such that 



og 
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9.21 Mutual information game. Consider the following channel: 

z 


Throughout this problem we shall constrain the signal power 

EX = 0, EX 2 = P, (9.176) 

and the noise power 

EZ = 0, EZ 2 = N, (9.177) 

and assume that X and Z are independent. The channel capacity is 
given by X + Z). 

Now for the game. The noise player chooses a distribution on Z to 
minimize I{X\ X + Z), while the signal player chooses a distribu¬ 
tion on X to maximize I(X; X + Z). Letting X* 〜 A/*(0, P), Z* 〜 
A/"(0, N), show that Gaussian X* and Z* satisfy the saddlepoint 
conditions 

Z + Z*) < /(X*;X* + Z*) < /(X*;X* + Z). (9.178) 

Thus, 

min max I (X; X + Z) = max min I (X; X + Z) (9.179) 

z x x z 

=▲ log (1 + 5 )， (9.180) 

and the game has a value. In particular, a deviation from normal 
for either player worsens the mutual information from that player’s 
standpoint. Can you discuss the implications of this? 

Note: Part of the proof hinges on the entropy power inequality from 
Section 17.8, which states that if X and Y are independent random 
行 -vectors with densities, then 


2 |mx+y) ^ 2 ^h(X) + 2 |my) 


(9.181) 
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9.22 Recovering the noise. Consider a standard Gaussian channel Y n = 
X n + Z n , where Z/ is i.i.d. 〜 A/*(0, AO, Z = 1,2, •••，"，and 
< P • Here we are interested in recovering the noise Z n 
and we don’t care about the signal X n • By sending = (0, 0 ,…， 
0)，the receiver gets Y n = Z n and can fully determine the value of 
Z n • We wonder how much variability there can be in X n and still 
recover the Gaussian noise Z n . Use of the channel looks like 



Z n (Y n ) 


Argue that for some R > 0, the transmitter can arbitrarily send one 
of 2 nR different sequences of x n without affecting the recovery of 
the noise in the sense that 

Pr{Z n ^ Z n } 0 as n ^ oc. 


For what R is this possible? 


HISTORICAL NOTES 

The Gaussian channel was first analyzed by Shannon in his original 
paper [472]. The water-filling solution to the capacity of the colored 
noise Gaussian channel was developed by Shannon [480] and treated in 
detail by Pinsker [425]. The time-continuous Gaussian channel is treated 
in Wyner [576], Gallager [233], and Landau, Poliak, and Slepian [340, 
341, 500]. 

Pinsker [421] and Ebert [178] argued that feedback at most doubles 
the capacity of a nonwhite Gaussian channel; the proof in the text is 
from Cover and Pombra [136], who also show that feedback increases 
the capacity of the nonwhite Gaussian channel by at most half a bit. 
The most recent feedback capacity results for nonwhite Gaussian noise 
channels are due to Kim [314]. 







CHAPTER 10 


RATE DISTORTION THEORY 


The description of an arbitrary real number requires an infinite number 
of bits, so a finite representation of a continuous random variable can 
never be perfect. How well can we do? To frame the question appropri¬ 
ately, it is necessary to define the “goodness” of a representation of a 
source. This is accomplished by defining a distortion measure which is a 
measure of distance between the random variable and its representation. 
The basic problem in rate distortion theory can then be stated as follows: 
Given a source distribution and a distortion measure, what is the minimum 
expected distortion achievable at a particular rate? Or, equivalently, what 
is the minimum rate description required to achieve a particular distortion? 

One of the most intriguing aspects of this theory is that joint descriptions 
are more efficient than individual descriptions. It is simpler to describe an 
elephant and a chicken with one description than to describe each alone. This 
is true even for independent random variables. It is simpler to describe X\ 
and X 2 together (at a given distortion for each) than to describe each by itself. 
Why don’t independent problems have independent solutions? The answer 
is found in the geometry. Apparently, rectangular grid points (arising from 
independent descriptions) do not fill up the space efficiently. 

Rate distortion theory can be applied to both discrete and continuous 
random variables. The zero-error data compression theory of Chapter 5 
is an important special case of rate distortion theory applied to a discrete 
source with zero distortion. We begin by considering the simple problem 
of representing a single continuous random variable by a finite number 
of bits. 


10.1 QUANTIZATION 

In this section we motivate the elegant theory of rate distortion by showing 
how complicated it is to solve the quantization problem exactly for a single 
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random variable. Since a continuous random source requires infinite preci¬ 
sion to represent exactly, we cannot reproduce it exactly using a finite-rate 
code. The question is then to find the best possible representation for any 
given data rate. 

We first consider the problem of representing a single sample from the 
source. Let the random variable be represented be X and let the represen¬ 
tation of X be denoted as X(X). If we are given R bits to represent X, 
the function X can take on 2 R values. The problem is to find the optimum 
set of values for X (called the reproduction points or code points) and 
the regions that are associated with each value X. 

For example, let X 〜 A/*(0, a 2 ), and assume a squared-error distortion 
measure. In this case we wish to find the function X(Z) such that X takes 
on at most 2 R values and minimizes E(X — X(X)) 2 . If we are given one 
bit to represent X, it is clear that the bit should distinguish whether or 
not X > 0. To minimize squared error, each reproduced symbol should 
be the conditional mean of its region. This is illustrated in Figure 10.1. 
Thus, 



( 10 . 1 ) 


If we are given 2 bits to represent the sample, the situation is not as 
simple. Clearly, we want to divide the real line into four regions and use 



FIGURE 10.1. One-bit quantization of Gaussian random variable. 
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a point within each region to represent the sample. But it is no longer 
immediately obvious what the representation regions and the reconstruc¬ 
tion points should be. We can, however, state two simple properties of 
optimal regions and reconstruction points for the quantization of a single 
random variable: 

• Given a set {Z(w;)} of reconstruction points, the distortion is mini¬ 
mized by mapping a source random variable X to the representation 
X(w) that is closest to it. The set of regions of X defined by this 
mapping is called a Voronoi or Dirichlet partition defined by the 
reconstruction points. 

• The reconstruction points should minimize the conditional expected 
distortion over their respective assignment regions. 

These two properties enable us to construct a simple algorithm to find a 
“good” quantizer: We start with a set of reconstruction points, find the opti¬ 
mal set of reconstruction regions (which are the nearest-neighbor regions 
with respect to the distortion measure), then find the optimal reconstruc¬ 
tion points for these regions (the centroids of these regions if the distortion 
is squared error), and then repeat the iteration for this new set of recon¬ 
struction points. The expected distortion is decreased at each stage in the 
algorithm, so the algorithm will converge to a local minimum of the dis¬ 
tortion. This algorithm is called the Lloyd algorithm [363] (for real-valued 
random variables) or the generalized Lloyd algorithm [358] (for vector¬ 
valued random variables) and is frequently used to design quantization 
systems. 

Instead of quantizing a single random variable, let us assume that we 
are given a set of n i.i.d. random variables drawn according to a Gaussian 
distribution. These random variables are to be represented using nR bits. 
Since the source is i.i.d., the symbols are independent, and it may appear 
that the representation of each element is an independent problem to be 
treated separately. But this is not true, as the results on rate distortion 
theory will show. We will represent the entire sequence by a single index 
taking 2 nR values. This treatment of entire sequences at once achieves a 
lower distortion for the same rate than independent quantization of the 
individual samples. 


10.2 DEFINITIONS 

Assume that we have a source that produces a sequence X\, X 2 ,..., X n 
i.i.d. 〜 e A". For the proofs in this chapter, we assume that the 
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if X = X 

if x ^ x, 


(10.4) 


d(x, x )= 


which results in a probability of error distortion, since Ed(X, X)= 
Pr(X ^ X). 


x n 



x n 


FIGURE 10.2. Rate distortion encoder and decoder. 

alphabet is finite, but most of the proofs can be extended to continuous 
random variables. The encoder describes the source sequence X n by an 
index f n (X n ) g {1,2,..., 2 nR }. The decoder represents X n by an estimate 
X n e X, as illustrated in Figure 10.2. 

Definition A distortion function or distortion measure is a mapping 

d : Xx X^n + (10.2) 

from the set of source alphabet-reproduction alphabet pairs into the set of 
nonnegative real numbers. The distortion d(x, x) is a measure of the cost 
of representing the symbol x by the symbol x. 

Definition A distortion measure is said to be bounded if the maximum 
value of the distortion is finite: 


def 


max d{x, x) < oc. 

xeX,xeX 


(10.3) 


In most cases, the reproduction alphabet X is the same as the source 
alphabet X. 

Examples of common distortion functions are 

• Hamming (probability of error) distortion. The Hamming distortion 
is given by 


o 1 
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• Squared-error distortion. The squared-error distortion, 


d(x ， x) = (x — x) 


(10.5) 


is the most popular distortion measure used for continuous alphabets. 
Its advantages are its simplicity and its relationship to least-squares 
prediction. But in applications such as image and speech coding, 
various authors have pointed out that the mean-squared error is not an 
appropriate measure of distortion for human observers. For example, 
there is a large squared-error distortion between a speech waveform 
and another version of the same waveform slightly shifted in time, 
even though both would sound the same to a human observer. 

Many alternatives have been proposed; a popular measure of distortion 
in speech coding is the Itakura-Saito distance ， which is the relative entropy 
between multivariate normal processes. In image coding, however, there is 
at present no real alternative to using the mean-squared error as the distortion 
measure. 

The distortion measure is defined on a symbol-by-symbol basis. We 
extend the definition to sequences by using the following definition: 

Definition The distortion between sequences x n and x n is defined by 



d(x n 


( 10 . 6 ) 


So the distortion for a sequence is the average of the per symbol dis¬ 
tortion of the elements of the sequence. This is not the only reasonable 
definition. For example, one may want to measure the distortion between 
two sequences by the maximum of the per symbol distortions. The the¬ 
ory derived below does not apply directly to this more general distortion 
measure. 

Definition A (2 nR , n)-rate distortion code consists of an encoding func¬ 
tion, 


fn : x n ^ {l,2,...,2 nR }, 


(10.7) 


and a decoding (reproduction) function, 




( 10 . 8 ) 
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The distortion associated with the (2 nR , n) code is defined as 

D = Ed(X n ,g n (f n (X n ))), (10.9) 

where the expectation is with respect to the probability distribution on X: 

D = P(x n )d(x n , g n (fn(x n ))). (10.10) 

x n 

The set of n-tuples g w (l), g n (2),..., g n (2 nR ), denoted by X n (l ),..., 
X n (2 nR ), constitutes the codebook, and are the 

associated assignment regions. 

Many terms are used to describe the replacement of X n by its quantized 
version X n (w). It is common to refer to X n as the vector quantiza¬ 
tion, reproduction, reconstruction, representation, source code, or estimate 
of X' 


Definition A rate distortion pair (/?, D) is said to be achievable if 
there exists a sequence of {2 nR , n)-rate distortion codes (f n , g n ) with 
lim^ooEcKX",^^^))) < D. 

Definition The rate distortion region for a source is the closure of the 
set of achievable rate distortion pairs (/?, D). 

Definition The rate distortion function R(D) is the infimum of rates R 
such that (/?, D) is in the rate distortion region of the source for a given 
distortion D. 


Definition The distortion rate function D(R) is the infimum of all dis¬ 
tortions D such that (/?, D) is in the rate distortion region of the source 
for a given rate R. 

The distortion rate function defines another way of looking at the 
boundary of the rate distortion region. We will in general use the rate 
distortion function rather than the distortion rate function to describe this 
boundary, although the two approaches are equivalent. 

We now define a mathematical function of the source, which we call 
the information rate distortion function. The main result of this chapter 
is the proof that the information rate distortion function is equal to the 
rate distortion function defined above (i.e., it is the infimum of rates that 
achieve a particular distortion). 


10.3 CALCULATION OF THE RATE DISTORTION FUNCTION 


307 


Definition The information rate distortion function R ⑴ (D) for a source 
X with distortion measure d(x, x) is defined as 

R (I \D) = min I{X\ X), (10.11) 

p(x\x):J2( x ,x) p(x)p(x\x)d(x,x)<D 

where the minimization is over all conditional distributions p(x\x) for 
which the joint distribution p(x, x) = p(x)p(x\x) satisfies the expected 
distortion constraint. 

Paralleling the discussion of channel capacity in Chapter 7, we initially 
consider the properties of the information rate distortion function and 
calculate it for some simple sources and distortion measures. Later we 
prove that we can actually achieve this function (i.e., there exist codes with 
rate R^(D) with distortion D). We also prove a converse establishing 
that R > R ⑺ （ D) for any code that achieves distortion D. 

The main theorem of rate distortion theory can now be stated as follows: 

Theorem 10.2.1 The rate distortion function for an i.i.d. source X 
with distribution p(x) and bounded distortion function d(x, x) is equal to 
the associated information rate distortion function. Thus, 

R(D) = R (I) (D) = min I(X; X) (10.12) 

p(x\x):J2(x,x) p(x)p(x\x)d(x,x)<D 

is the minimum achievable rate at distortion D. 

This theorem shows that the operational definition of the rate distortion 
function is equal to the information definition. Hence we will use R(D) 
from now on to denote both definitions of the rate distortion function. 
Before coming to the proof of the theorem, we calculate the information 
rate distortion function for some simple sources and distortions. 


10.3 CALCULATION OF THE RATE DISTORTION FUNCTION 
10.3.1 Binary Source 

We now find the description rate R(D) required to describe a Bernoulli(p) 
source with an expected proportion of errors less than or equal to D. 


Theorem 10.3.1 The rate distortion function for a Bernoulli^ p) source 
with Hamming distortion is given by 


R(D) = 


H(p)~ H(D), 0<D< mm{p, 1 - p}, 
0, D > min{p, 1 — p}. 


(10.13) 
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Proof: Consider a binary source X 〜 Bernoulli(p) with a Hamming 
distortion measure. Without loss of generality, we may assume that p < 
We wish to calculate the rate distortion function, 

R(D) = min 7(X;X). (10.14) 

p(x\x):J2( x ,x) p(x)p(x\x)d(x,x)<D 

Let ㊉ denote modulo 2 addition. Thus, X ㊉ 戈 =1 is equivalent to X ^ 
X. We do not minimize I (X; X) directly; instead, we find a lower bound 
and then show that this lower bound is achievable. For any joint distribu¬ 


tion satisfying the distortion constraint, we have 

I{X\X) = H(X) - H(X\X) (10.15) 

= H(p)- H(X ㊉ X\X) (10.16) 

> H(p)-H(X ㊉ X) (10.17) 

> (10.18) 

since Pr(X ^ X) < D and H(D) increases with D for D < Thus, 

R(D) > H(p) - H(D). (10.19) 


We now show that the lower bound is actually the rate distortion function 
by finding a joint distribution that meets the distortion constraint and 
has I(X] X) = R(D). For 0 < D < p, we can achieve the value of the 
rate distortion function in (10.19) by choosing (X, X) to have the joint 
distribution given by the binary symmetric channel shown in Figure 10.3. 


1 -p-D 1 ~ D 



o 


x 



P 


FIGURE 10.3. Joint distribution for binary source. 
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We choose the distribution of X at the input of the channel so that the 
output distribution of X is the specified distribution. Let r = Pr(X = 1). 
Then choose r so that 


r(l - D) + (l- r)D = p, (10.20) 


or 


r = 


p — D 
1 - 2D 


( 10 . 21 ) 


If D < p < then Pr(X = 1) > 0 and Pr(Z = 0) > 0. We then have 

I{X\X) = H(X) - H(X\X) = H(p) - H(D )， (10.22) 


and the expected distortion is Pr(Z z/z X) = D. 

If D > p, we can achieve R(D) = 0 by letting X = 0 with probability 
1. In this case, I{X\ Z) = 0 and D = p. Similarly, if D > 1 — p, we can 
achieve R(D) = 0 by setting X = l with probability 1. Hence, the rate 
distortion function for a binary source is 


R(D) = 


H(p)- H(D), 0<D< min{p, 1 - p}, 
0, D > min{p，1 _ p}. 


(10.23) 


This function is illustrated in Figure 10.4. 


□ 



FIGURE 10.4. Rate distortion function for a Bernoulli (^) source. 
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(10.24) 


The above calculations may seem entirely unmotivated. Why should 
minimizing mutual information have anything to do with quantization? 
The answer to this question must wait until we prove Theorem 10.2.1. 

10.3.2 Gaussian Source 

Although Theorem 10.2.1 is proved only for discrete sources with a 
bounded distortion measure, it can also be proved for well-behaved contin¬ 
uous sources and unbounded distortion measures. Assuming this general 
theorem, we calculate the rate distortion function for a Gaussian source 
with squared-error distortion. 

Theorem 10.3.2 The rate distortion function for a a 2 ) source with 

squared-error distortion is 


Proof: Let X be 〜 a 2 ). By the rate distortion theorem extended 
to continuous alphabets, we have 

R(D) = min 7(X;Z). (10.25) 

f(x\x):E(X-X) 2 <D 

As in the preceding example, we first find a lower bound for the rate 
distortion function and then prove that this is achievable. Since E{X — 
X) 2 < D, we observe that 

I{X\ X) = h{X) - h{X\X) (10.26) 

- ^ \og{2ne)a 2 - h(X - X\X) (10.27) 

> ^ \og(2jze)a 2 - h(X - X) (10.28) 

> ^ \og(2ne)a 2 - h(Af(0, E(X — X) 2 )) (10.29) 

=- log(2jre)a 2 - ^ log(2ire)E(X - X) 2 (10.30) 

> \og{2ne)o 2 — ^ log(27re)D (10.31) 

1 cr 2 

= 2 lOS ~D^ (10 . 32) 


2 

cr 

<- 2 - 

Qcr 

VI > 

OD 

a 2 IQ 

og 

112a 


p) 
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(10.35) 


and E(X — X) 2 = D, thus achieving the bound in (10.33). If D > a 2 , we 
choose X = 0 with probability 1, achieving R(D) = 0. Hence, the rate 
distortion function for the Gaussian source with squared-error distortion is 


R(D) 


1 a 
- log — 

2 B D 

0, 


0 < D < O- 2 , 
D > a 2 , 


(10.36) 


as illustrated in Figure 10.6. □ 

We can rewrite (10.36) to express the distortion in terms of the rate, 

D(R) =a 2 l~ 2R . (10.37) 


Z~AA(0,D) 


X-Af(0, a 2 -D)' 


<±> 


X~Af(0,a 2 ) 


where (10.28) follows from the fact that conditioning reduces entropy and 
(10.29) follows from the fact that the normal distribution maximizes the 
entropy for a given second moment (Theorem 8.6.5). Hence, 

1 a 2 

R{D) > - log—. (10.33) 

To find the conditional density f{x\x) that achieves this lower bound, 
it is usually more convenient to look at the conditional density f(x\x), 
which is sometimes called the test channel (thus emphasizing the duality of 
rate distortion with channel capacity). As in the binary case, we construct 
f{x\x) to achieve equality in the bound. We choose the joint distribution 
as shown in Figure 10.5. If D < cr 2 , we choose 

X = X + Z, 文〜鄉 — D\ Z 〜鄭， D), (10.34) 

where X and Z are independent. For this joint distribution, we calculate 


a 2 lD 

og 

11 2 


FIGURE 10.5. Joint distribution for Gaussian source. 
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FIGURE 10.6. Rate distortion function for a Gaussian source. 


Each bit of description reduces the expected distortion by a factor of 4. 
With a 1-bit description, the best expected square error is a 2 /A. We can 
compare this with the result of simple 1-bit quantization of a A/*(0, a 2 ) 
random variable as described in Section 10.1. In this case, using the two 
regions corresponding to the positive and negative real lines and repro¬ 
duction points as the centroids of the respective regions, the expected dis¬ 
tortion is ( 兀一 2 ) 汀 2 = 0.3633a 2 (see Problem 10.1). As we prove later, the 
rate distortion limit R{D) is achieved by considering long block lengths. 
This example shows that we can achieve a lower distortion by consider¬ 
ing several distortion problems in succession (long block lengths) than can 
be achieved by considering each problem separately. This is somewhat 
surprising because we are quantizing independent random variables. 


10.3.3 Simultaneous Description of ■ndependent Gaussian 
Random Variables 


Consider the case of representing m independent (but not identically dis¬ 
tributed) normal random sources X \,..., X m , where Xi are 〜 AT(0, af), 
with squared-error distortion. Assume that we are given R bits with which 
to represent this random vector. The question naturally arises as to how 
we should allot these bits to the various components to minimize the 
total distortion. Extending the definition of the information rate distortion 
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function to the vector case, we have 

R(D) = min I(X m ; X m ), (10.38) 

f(x m \x m ):Ed(X m ,X m )<D 


where d(x m , x m ) = YlT=i^ x i ~ ^') 2 - Now using the arguments in the pre¬ 
ceding example, we have 


I(X m ; X m ) = h(X m ) - h(X m \X m ) 

(10.39) 

mm 

i=l i=l 

(10.40) 

m m 

i=l i=l 

(10.41) 

m 

i=l 

(10.42) 

m 

> R(Di) 

i = \ 

(10.43) 

J" a 2\ + 


， 

(10.44) 


where Z)/ = E(Xi — Xi ) 2 and (10.41) follows from the fact that condi¬ 
tioning reduces entropy. We can achieve equality in (10.41) by choosing 
f(x m \x m ) = Y\7 =i f (- x i\^i) and in (10.43) by choosing the distribution of 
each Z/ 〜 AT(0, af — D/), as in the preceding example. Hence, the prob¬ 
lem of finding the rate distortion function can be reduced to the following 
optimization (using nats for convenience): 


R(D) 


min V 

J ： Di=D^ 


max 



Using Lagrange multipliers, we construct the functional 


m 1 2 m 

J ( D ) = 2 ln + ^ D/, 

i=l 1 i=l 


(10.45) 


(10.46) 
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and differentiating with respect to Di and setting equal to 0, we have 

dJ 


dDi 


~2A +A = 0 

(10.47) 

Di = X r . 

(10.48) 


or 


Hence, the optimum allotment of the bits to the various descriptions 
results in an equal distortion for each random variable. This is possible if 
the constant X f in (10.48) is less than af for all i. As the total allowable 
distortion D is increased, the constant A/ increases until it exceeds af 
for some i. At this point the solution (10.48) is on the boundary of the 
allowable region of distortions. If we increase the total distortion, we must 
use the Kuhn-Tucker conditions to find the minimum in (10.46). In this 
case the Kuhn-Tucker conditions yield 


dJ 11 

9A =_ 2A +X， 


where 入 is chosen so that 

dJ 


dDi 


= 0 if Dt < af 
< 0 if Di > af. 


(10.49) 


(10.50) 


It is easy to check that the solution to the Kuhn-Tucker equations is given 
by the following theorem: 


Theorem 10.3.3 (Rate distortion for a parallel Gaussian source) Let 
Xi 〜 A^(0, cr?), / = 1, 2, .. •, m, be independent Gaussian random vari¬ 
ables, and let the distortion measure be d(x m , x m ) = YlT=i^ x i — 为 ） 2 . 
Then the rate distortion function is given by 


m 2 

i=l 


(10.51) 


where 



ifk< cr, 2 , 

矿入 > cr; 2 , 


where X is chosen so that YlUi = D. 


(10.52) 
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FIGURE 10.7. Reverse water-filling for independent Gaussian random variables. 


This gives rise to a kind of reverse water-filling, as illustrated in 
Figure 10.7. We choose a constant 入 and only describe those random vari¬ 
ables with variances greater than 入 . No bits are used to describe random 
variables with variance less than 入 . Summarizing, if 


X 〜 库， 


- 0 

)，then X ~ 

、 i 2 . 

. 0 


0 •• 

- a l 


0 • 



and E(Xi — Xi) 2 = D“ where D[ = min{A, af). More generally, the rate 
distortion function for a multivariate normal vector can be obtained by 
reverse water-filling on the eigenvalues. We can also apply the same argu¬ 
ments to a Gaussian stochastic process. By the spectral representation 
theorem, a Gaussian stochastic process can be represented as an inte¬ 
gral of independent Gaussian processes in the various frequency bands. 
Reverse water-filling on the spectrum yields the rate distortion function. 

10.4 CONVERSE TO THE RATE DISTORTION THEOREM 

In this section we prove the converse to Theorem 10.2.1 by showing that 
we cannot achieve a distortion of less than D if we describe X at a rate 
less than R(D )，where 

R(D) = min /(Z; X). (10.53) 

p(x\x):J2( x ,x) p(x)p(x\x)d(x,x)<D 
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The minimization is over all conditional distributions p(x\x) for which 
the joint distribution p(x, x) = p(x)p(x\x) satisfies the expected distor¬ 
tion constraint. Before proving the converse, we establish some simple 
properties of the information rate distortion function. 

Lemma 10.4.1 (Convexity of R(D)) The rate distortion function R(D) 

given in (10.53) is a nonincreasing convex function of D. 

Proof: R(D) is the minimum of the mutual information over increas¬ 
ingly larger sets as D increases. Thus, R(D) is nonincreasing in D. To 
prove that R(D) is convex, consider two rate distortion pairs, (R\, D\) 
and (/? 2 , D 2 ), which lie on the rate distortion curve. Let the joint distribu¬ 
tions that achieve these pairs be p\(x, x) = p{x)p\{x\x) and p 2 (x, x)= 
p(x)p 2 (x\x). Consider the distribution px = 入 pi + (1 — 入 ) Since the 
distortion is a linear function of the distribution, we have D(px) = ^D\ + 
(1 — 入 )Z> 2 . Mutual information, on the other hand, is a convex function 
of the conditional distribution (Theorem 2.7.4), and hence 


I Pk (X;X)< XI PI (X; X) + (1 - X)I P2 (X;X). (10.54) 

Hence, by the definition of the rate distortion function, 

R(D^) < Ip,(X-,X) (10.55) 

<17 P1 (X;X) + (1 -1)/ P2 (X;X) (10.56) 

=XRiDi) + (l - l)R(D 2 ), (10.57) 

which proves that R(D) is a convex function of D. □ 


The converse can now be proved. 

Proof: (Converse in Theorem 10.2.1). We must show for any source X 
drawn i.i.d •〜 p{x) with distortion measure d(x, x) and any (2 nR , n) rate 
distortion code with distortion < D, that the rate R of the code satis¬ 
fies R > R(D). In fact, we prove that R > R(D) even for randomized 
mappings f n and g n , as long as f n takes on at most 2 nR values. 

Consider any (2 nR , n) rate distortion code defined by functions f n and 
g n as given in (10.7) and (10.8). Let X n = X n (X n ) = g n (fn(X n )) be the 
reproduced sequence corresponding to X n • Assume that Ed(X n , X n ) > D 
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for this code. Then we have the following chain of inequalities: 


(a) 

nR > H{f n {X n )) 

> H(f n (X n )) - H(f n (X n )\X n ) 

=I(X n -, f n (X n )) 

(c) . 

> I(X n .，X n ) 

= H{X n ) - H(X"\X n ) 

n 

= H(Xi) - H(X n \X n ) 

i=l 

n n 

=J2 A-i ， ... ， x i) 

/ =1 i=l 

(f) A A . 

i=l i=l 

n 

= jy(xd) 

i=l 

(g) A . 

i=l 

(h) / 1 ^ . \ 

^nR^-Y^EdiXi^Xi)] 

= nR(Ed(X n ,X n )) 

= nR(D), 


(10.58) 

(10.59) 

(10.60) 

(10.61) 

(10.62) 

(10.63) 

(10.64) 

(10.65) 

( 10 . 66 ) 

(10.67) 

( 10 . 68 ) 

(10.69) 

(10.70) 

(10.71) 


where 

(a) follows from the fact that the range of f n is at most 2 nR 

(b) follows from the fact that H(f n (X n )\X n ) > 0 
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(c) follows from the data-processing inequality 

(d) follows from the fact that the Xi are independent 

(e) follows from the chain rule for entropy 

(f) follows from the fact that conditioning reduces entropy 

(g) follows from the definition of the rate distortion function 

(h) follows from the convexity of the rate distortion function (Lemma 
10.4.1) and Jensen’s inequality 

(i) follows from the definition of distortion for blocks of length n 

(j) follows from the fact that R(D) is a nonincreasing function of D and 
Ed(X n ,X n ) < D 

This shows that the rate R of any rate distortion code exceeds the rate 
distortion function R(D) evaluated at the distortion level D = Ed(X n ， X n ) 
achieved by that code. □ 

A similar argument can be applied when the encoded source is passed 
through a noisy channel and hence we have the equivalent of the source 
channel separation theorem with distortion: 

Theorem 10.4.1 (Source—channel separation theorem with distortion) 
Let V\, V 2 ,. • •, V n be a finite alphabet i.i.d. source which is encoded as 
a sequence of n input symbols X n of a discrete memoryless channel with 
capacity C. The output of the channel Y n is mapped onto the reconstruction 
alphabet V n = g(Y n ). Let D = Ed(V n , V n ) = } Ed(Vi ， Vi) be the 
average distortion achieved by this combined source and channel coding 
scheme. Then distortion D is achievable if and only ifC> R(D). 


V n ― ►X A7 ( V n ) ― ► Channel Capacity C —►/ n - ^v n 


Proof: See Problem 10.17. □ 


10.5 ACHIEVABILITY OF THE RATE DISTORTION FUNCTION 

We now prove the achievability of the rate distortion function. We begin 
with a modified version of the joint AEP in which we add the condition 
that the pair of sequences be typical with respect to the distortion measure. 
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Definition Let p(x, x) be a joint probability distribution on X y. X and 
let d(x, x) be a distortion measure on X x X. For any 6 > 0, a pair of 
sequences (x n , x n ) is said to be distortion e-typical or simply distortion 
typical if 


-logp(x n )-H(X) 

n 


logp(x n )-H(X) 




-logp(x\x n )-H(X, X) 


< € 


< € 


<6 


\d(x n ,x n )- Ed(X, X)\ < 6. 


(10.72) 

(10.73) 

(10.74) 

(10.75) 


The set of distortion typical sequences is called the distortion typical set 
and is denoted 

Note that this is the definition of the jointly typical set (Section 7.6) 
with the additional constraint that the distortion be close to the expected 
value. Hence, the distortion typical set is a subset of the jointly typical 
set (i.e., A^l C A^). If (X“ Xi) are drawn i.i.d 〜 p(x, x), the distortion 
between two random sequences 


d(X n ,X n )= 


- 工 n) 


(10.76) 


is an average of i.i.d. random variables, and the law of large numbers 
implies that it is close to its expected value with high probability. Hence 
we have the following lemma. 

Lemma 10.5.1 Let (Z/, Xi) be drawn i.i.d. 〜 〆 x, 戈 ). Then Pr(A^) ^ 
1 as n oo. 

Proof: The sums in the four conditions in the definition of are 
all normalized sums of i.i.d random variables and hence, by the law of 
large numbers, tend to their respective expected values with probability 1. 
Hence the set of sequences satisfying all four conditions has probability 
tending to 1 as n ^ oc. □ 


The following lemma is a direct consequence of the definition of the 
distortion typical set. 
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Lemma 10.5.2 For all (x n , x n ) e 

p{x n ) > p(x n \x n )2- n(I{X ^ )+3e) . (10.77) 


Proof: Using the definition of we can bound the probabilities 

p(x n ), p(x n ) and p(x n , x n ) for all {x n , x n ) e and hence 


p{x n \x n ) = 


p(x n , x n ) 
p(x n ) 




< pm 


2 ~n(H(X,X)-€) 
2 ~n(H(X) + €) 2 ~n(H(X)+€) 


=p(x n )2 n(I(x ^ )+3€) 


(10.78) 

(10.79) 

(10.80) 
(10.81) 


and the lemma follows immediately. □ 

We also need the following interesting inequality. 

Lemma 10.5.3 For 0 < x, y < l, n > 0, 

(1 - xy) n <l-x + e~ yn . (10.82) 

Proof: Let f{y) = e~ y — l + y. Then /(0) = 0 and f\y) = —e~ y + 
1 > 0 for y > 0, and hence f{y) > 0 for ;y > 0. Hence for 0 < y < 1, 
we have 1 — 3 ; < e~ y , and raising this to the nth power, we obtain 

(1 - y) n < e~ yn . (10.83) 


Thus, the lemma is satisfied for x = l. By examination, it is clear that 
the inequality is also satisfied for x = 0. By differentiation, it is easy 
to see that g y (x) = (1 — xy) n is a convex function of x, and hence for 
0 < x < 1 , we have 


(1 - xy) n = g y (x) (10.84) 

S (1 —x) 办 ⑼ + x 办 (1) (10.85) 

= (l-x)l+x(l-y) n ( 10 . 86 ) 

< 1 -x+xe~ yn (10.87) 

< 1 -x + e~ yn . □ ( 10 . 88 ) 


We use the preceding proof to prove the achievability of Theorem 10.2.1. 
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Proof: (Achievability in Theorem 10.2.1). Let X\, X 2 , • • • ， be drawn 
i.i.d •〜 p{x) and let d(x, x) be a bounded distortion measure for this 
source. Let the rate distortion function for this source be R(D). Then for 
any D, and any R > R(D )， we will show that the rate distortion pair 
(R ， D) is achievable by proving the existence of a sequence of rate dis¬ 
tortion codes with rate R and asymptotic distortion D. Fix p{x\x), where 
p(x\x) achieves equality in (10.53). Thus, I(X; X) = R(D). Calculate 
p(x) = p{x)p{x\x). Choose S > 0. We will prove the existence of a 

rate distortion code with rate R and distortion less than or equal to D + 8. 

Generation of codebook: Randomly generate a rate distortion codebook 
C consisting of 2 nR sequences X n drawn i.i.d •〜 P(D. Index these 
codewords by w; G {1, 2,... , 2 nR }. Reveal this codebook to the encoder 
and decoder. 

Encoding: Encode X n by w if there exists a w such that (X n , X n (w)) G 
the distortion typical set. If there is more than one such w, send the 
least. If there is no such w, let w = l. Thus, nR bits suffice to describe 
the index w of the jointly typical codeword. 

Decoding: The reproduced sequence is X n (w). 

Calculation of distortion: As in the case of the channel coding theorem, 
we calculate the expected distortion over the random choice of codebooks 
C as 

D = E X n^ c d{X n ,X n ), (10.89) 


where the expectation is over the random choice of codebooks and over 
X n . 

For a fixed codebook C and choice of 6 > 0, we divide the sequences 
x n e X n into two categories: 

• Sequences x n such that there exists a codeword X n (w) that is dis¬ 
tortion typical with x n [i.e., d(x n , x n {w)) < D + e]. Since the total 
probability of these sequences is at most 1, these sequences contribute 
at most £) + 6 to the expected distortion. 

• Sequences x n such that there does not exist a codeword X n \w) 
that is distortion typical with x n . Let P e be the total probability of 
these sequences. Since the distortion for any individual sequence is 
bounded by J max , these sequences contribute at most P e d max to the 
expected distortion. 

Hence, we can bound the total distortion by 


Ed(X n , X n (X n )) <D + e + P e d m ^ 


( 10 . 90 ) 
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Let us define 

K(x n ,x n )= 


(10.93) 


The probability that a single randomly chosen codeword X n does not well 
represent a fixed x n is 

Pr((x n ,X n ) ^ A ( d n) e ) = Pr(K(x n , X n ) = 0) = 1 - p(x n )K(x n , x n ), 

X n 

(10.94) 

and therefore the probability that 2 nR independently chosen codewords do 
not represent x n , averaged over p{x n ), is 


P e = ^P(x n ) [ p(C) 
x n C\x n iJ{C) 




x n 


-J^p(x n )K(x n ,x n ) 


x n 




(10.95) 


(10.96) 


which can be made less than D + 3 for an appropriate choice of 6 if P e is 
small enough. Hence, if we show that P e is small, the expected distortion 
is close to D and the theorem is proved. 

Calculation of P e \ We must bound the probability that for a random 
choice of codebook C and a randomly chosen source sequence, there is 
no codeword that is distortion typical with the source sequence. Let J{C) 
denote the set of source sequences x n such that at least one codeword in 
C is distortion typical with x n . Then 

Pe = J2 E P(，). (10.91) 

C x n \x n iJ{C) 

This is the probability of all sequences not well represented by a code, 
averaged over the randomly chosen code. By changing the order of sum¬ 
mation, we can also interpret this as the probability of choosing a code¬ 
book that does not well represent sequence x n , averaged with respect to 
p(x n ). Thus, 

P e = J2p(，) E P^- (10.92) 

C\x n iJ{C) 


3 ^,(«^, 
A ( f A ( 6 


X X 
/V /l\ 

f f 
• 1 .1 
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We now use Lemma 10.5.2 to bound the sum within the brackets. From 
Lemma 10.5.2, it follows that 

^p{x n )K{x n ,x n ) > p{x n \x n )2~ n{I(x '^ )+3e) K{x n , x n ), (10.97) 

X n X n 

and hence 

P e <Y^P(x n ) /l - p(x n \x n )K(x n , x n )\ . 

X n \ X n / 

(10.98) 


We now use Lemma 10.5.3 to bound the term on the right-hand side 
of (10.98) and obtain 


(l - 2 - n(/(z；1)+3e) p(x n \x n )K(x n ,x n ) 

\ X n , 




<l-J2x- P{x n \x n )K{x n ,X n ) + e -( 2 -"("U)+ 知 (1099) 
Substituting this inequality in (10.98), we obtain 

k 1 - E E P(x n )p(r\x n )K(x\x n ) + (㈣) 


x n 


( 10 . 100 ) 


The last term in the bound is equal to 

^_ 2 n(R-I(X-,X)-3€) 


( 10 . 101 ) 


which goes to zero exponentially fast with n if R > I(X; X) + 3^. Hence 
if we choose p{x\x) to be the conditional distribution that achieves the 
minimum in the rate distortion function, then R > R(D) implies that 
R > 1{X\X) and we can choose € small enough so that the last term in 
(10.100) goes to 0. 

The first two terms in (10.100) give the probability under the joint 
distribution p(x n , x n ) that the pair of sequences is not distortion typical. 
Hence, using Lemma 10.5.1, we obtain 

1 — E E p{x n ,x n )K{x n ,x n ) = Pr((X n , X n ) ^ Ag) < e 

X n X n 


( 10 . 102 ) 
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for n sufficiently large. Therefore, by an appropriate choice of e and n, 
we can make as small as we like. 

So, for any choice of 5 > 0, there exists an e and n such that over all 
randomly chosen rate R codes of block length n, the expected distortion 
is less than D + 8. Hence, there must exist at least one code C* with this 
rate and block length with average distortion less than D + 8. Since 8 was 
arbitrary, we have shown that (R, D) is achievable if /? > R(D). □ 

We have proved the existence of a rate distortion code with an expected 
distortion close to D and a rate close to R(D). The similarities between 
the random coding proof of the rate distortion theorem and the random 
coding proof of the channel coding theorem are now evident. We will 
explore the parallels further by considering the Gaussian example, which 
provides some geometric insight into the problem. It turns out that channel 
coding is sphere packing and rate distortion coding is sphere covering. 


Channel coding for the Gaussian channel. Consider a Gaussian channel, 
Yi = Xi + Zi, where the Z/ are i.i.d. ~ A/*(0, N) and there is a power 
constraint P on the power per symbol of the transmitted codeword. Con¬ 
sider a sequence of n transmissions. The power constraint implies that 
the transmitted sequence lies within a sphere of radius \JnP in TZ n . The 
coding problem is equivalent to finding a set of 2 nR sequences within 
this sphere such that the probability of any of them being mistaken for 
any other is small — the spheres of radius -J~nN around each of them are 
almost disjoint. This corresponds to filling a sphere of radius ^Jn{P + N) 
with spheres of radius \/nN. One would expect that the largest number 
of spheres that could be fit would be the ratio of their volumes, or, equiv¬ 
alently, the nth power of the ratio of their radii. Thus, if M is the number 
of codewords that can be transmitted efficiently, we have 


M < 


Wn(P + N)) n — /P +TVy 

(V^V) n ~ _ V N ) 


(10.103) 


The results of the channel coding theorem show that it is possible to do 
this efficiently for large n; it is possible to find approximately 

2 nC = : N y (10.104) 


codewords such that the noise spheres around them are almost disjoint 
(the total volume of their intersection is arbitrarily small). 
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Rate distortion for the Gaussian source. Consider a Gaussian source of 
variance a 2 . A (2 nR , n) rate distortion code for this source with distortion 
D is a set of 2 nR sequences in TZ n such that most source sequences of 
length n (all those that lie within a sphere of radius \!no 1 、are within a 
distance \JnD of some codeword. Again, by the sphere-packing argument, 
it is clear that the minimum number of codewords required is 


2nR(D) 



(10.105) 


The rate distortion theorem shows that this minimum rate is asymptotically 
achievable (i.e., that there exists a collection of spheres of radius \fnD 
that cover the space except for a set of arbitrarily small probability). 

The above geometric arguments also enable us to transform a good 
code for channel transmission into a good code for rate distortion. In both 
cases, the essential idea is to fill the space of source sequences: In channel 
transmission, we want to find the largest set of codewords that have a large 
minimum distance between codewords, whereas in rate distortion, we wish 
to find the smallest set of codewords that covers the entire space. If we 
have any set that meets the sphere packing bound for one, it will meet the 
sphere packing bound for the other. In the Gaussian case, choosing the 
codewords to be Gaussian with the appropriate variance is asymptotically 
optimal for both rate distortion and channel coding. 


10.6 STRONGLY TYPICAL SEQUENCES AND RATE 
DISTORTION 

In Section 10.5 we proved the existence of a rate distortion code of 
rate R(D) with average distortion close to D. In fact, not only is the 
average distortion close to D, but the total probability that the distor¬ 
tion is greater than D + 8 is close to 0. The proof of this is similar 
to the proof in Section 10.5; the main difference is that we will use 
strongly typical sequences rather than weakly typical sequences. This 
will enable us to give an upper bound to the probability that a typical 
source sequence is not well represented by a randomly chosen codeword 
in (10.94). We now outline an alternative proof based on strong typical¬ 
ity that will provide a stronger and more intuitive approach to the rate 
distortion theorem. 

We begin by defining strong typicality and quoting a basic theorem 
bounding the probability that two sequences are jointly typical. The 
properties of strong typicality were introduced by Berger [53] and were 
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explored in detail in the book by Csiszar and Komer [149]. We will 
define strong typicality (as in Chapter 11) and state a fundamental lemma 
(Lemma 10.6.2). 

Definition A sequence x n G X n is said to be €- strongly typical with 
respect to a distribution p(x) on X if: 


For all a e X with p{a) > 0, we have 


—N{a\x n ) — p{a) 
n 


< 


€ 


(10.106) 


2. For all a e X with p(a) = 0, N{a\x n ) = 0. 

N{a\x n ) is the number of occurrences of the symbol a in the sequence 

x n . 

The set of sequences x n e X n such that x n is strongly typical is called 
the strongly typical set and is denoted or A:( w ) when the random 

variable is understood from the context. 

Definition A pair of sequences (x n , y n ) e X n x y n is said to be 6- 
strongly typical with respect to a distribution p(x, y) on X x y if: 

1. For all (a, b) e X x y with p(a, b) > 0, we have 


-N(a, b\x n , y n ) — p(a,b) 


< 


I 棚 


(10.107) 


2. For all (a, b) e X x y with p(a, b) = 0, N(a, b\x n , y n ) = 0. 

N{a, b\x n , y n ) is the number of occurrences of the pair (a, b) in the pair 
of sequences (x n , y n ). 

The set of sequences (x n , y n ) e X n x y n such that (x n , y n ) is strongly 
typical is called the strongly typical set and is denoted A^ n \X, Y) or 
A^ n \ From the definition, it follows that if (x n , y n ) G Y), then 

x n G A:( n )(X). From the strong law of large numbers, the following 
lemma is immediate. 

Lemma 10.6.1 Let (X“ Yi) be drawn i.i.d. 〜 p(x, y). Then Pr(A: ⑻） 
\ as n ^ oo. 


We will use one basic result, which bounds the probability that an 
independently drawn sequence will be seen as jointly strongly typical 
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with a given sequence. Theorem 7.6.1 shows that if we choose X n and 
Y n independently, the probability that they will be weakly jointly typical 
is ^ The following lemma extends the result to strongly typical 

sequences. This is stronger than the earlier result in that it gives a lower 
bound on the probability that a randomly chosen sequence is jointly typical 
with a fixed typical x n . 

Lemma 10.6.2 Let Y\, Y 2 ,..., Y n be drawn i.i.d. 〜 p(y). For x n G 
A*^ n \X) f the probability that (x n , Y n ) G is bounded by 

2 -n(/(x ； F)+ fl ) < p r (( x « 5 Y n ) e A* {n) ) < 2 _n(/(x；7)_€l) , (10.108) 

where 61 goes to 0 as e ^ 0 and n ^ oo. 

Proof: We will not prove this lemma, but instead, outline the proof in 
Problem 10.16 at the end of the chapter. In essence, the proof involves 
finding a lower bound on the size of the conditionally typical set. □ 

We will proceed directly to the achievability of the rate distortion 
function. We will only give an outline to illustrate the main ideas. The 
construction of the codebook and the encoding and decoding are similar 
to the proof in Section 10.5. 

Proof: Fix p{x\x). Calculate p(x) = p(x)p(x\x). Fix 6 > 0. Later 

we will choose 6 appropriately to achieve an expected distortion less than 
D + 8. 

Generation of codebook: Generate a rate distortion codebook C consist¬ 
ing of 2 nR sequences X n drawn i.i.d •〜 ]~[. p{xi). Denote the sequences 
X n (l),...,X n (2 nR ). 

Encoding: Given a sequence X n ， index it by w if there exists a w such 
that (X n , X n (w)) e A*^ n \ the strongly jointly typical set. If there is more 
than one such w, send the first in lexicographic order. If there is no such 
w, let w = l. 

Decoding: Let the reproduced sequence be X n {w). 

Calculation of distortion: As in the case of the proof in Section 10.5, we 
calculate the expected distortion over the random choice of codebook as 


E X n iC d{X n ,X n ) 

(10.109) 

E c J^p(x n )d(x n ,X n (x n )) 

X n 

(10.110) 

J^p(x n )E c d(x n ,X n ), 

(10.111) 


x n 
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where the expectation is over the random choice of codebook. For a fixed 
codebook C, we divide the sequences x n e X n into three categories, as 
shown in Figure 10.8. 

• Nontypical sequences x n ♦ A: ⑻. The total probability of these 
sequences can be made less than e by choosing n large enough. Since 
the individual distortion between any two sequences is bounded by 
d max ，the nontypical sequences can contribute at most €d max to the 
expected distortion. 

• Typical sequences x n G A : ⑻ such that there exists a codeword X n (w) 
that is jointly typical with x n . In this case, since the source sequence 
and the codeword are strongly jointly typical, the continuity of the 
distortion as a function of the joint distribution ensures that they 
are also distortion typical. Hence, the distortion between these x n 
and their codewords is bounded by D + ed max ， and since the total 
probability of these sequences is at most 1, these sequences contribute 
at most D + ed max to the expected distortion. 

• Typical sequences x n e A:( w ) such that there does not exist a code¬ 
word X n that is jointly typical with x n . Let P e be the total probability 
of these sequences. Since the distortion for any individual sequence 
is bounded by d max ， these sequences contribute at most P e d max to the 
expected distortion. 
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The sequences in the first and third categories are the sequences that 
may not be well represented by this rate distortion code. The probability 
of the first category of sequences is less than e for sufficiently large n. The 
probability of the last category is P e ，which we will show can be made 
small. This will prove the theorem that the total probability of sequences 
that are not well represented is small. In turn, we use this to show that 
the average distortion is close to D. 

Calculation of P e : We must bound the probability that there is no code¬ 
word that is jointly typical with the given sequence X n . From the joint 
AEP, we know that the probability that X n and any X n are jointly typical 
is =2~ nI( ' X; ^\ Hence the expected number of jointly typical X n (w) is 
2 nR 2~ nI ^ x, ^\ which is exponentially large if R > I(X; X). 

But this is not sufficient to show that P e 0. We must show that the 
probability that there is no codeword that is jointly typical with X n goes 
to zero. The fact that the expected number of jointly typical codewords is 
exponentially large does not ensure that there will at least one with high 
probability. Just as in (10.94), we can expand the probability of error as 

P e = p(x n )[l - Pr((x w , X n ) G A* in) )f nR . (10.112) 

x n eA* w 

From Lemma 10.6.2 we have 

Pr((x n ,X n ) e A* {n) ) > 2-" (/(x：1)+fl) . (10.113) 

Substituting this in (10.112) and using the inequality (1 — x) n < e— nx , we 
have 

P e < (10.114) 

which goes to 0 as n —> oo if /? > I(X; X) + €\. Hence for an appropriate 
choice of e and n, we can get the total probability of all badly represented 
sequences to be as small as we want. Not only is the expected distortion 
close to D, but with probability going to 1, we will find a codeword whose 
distortion with respect to the given sequence is less than D + S. □ 

10.7 CHARACTERIZATION OF THE RATE DISTORTION 
FUNCTION 

We have defined the information rate distortion function as 

R(D) = min /(X;X), (10.115) 

p(x)q(x\x)d(x,x)<D 
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where the minimization is over all conditional distributions q(x\x) for 
which the joint distribution p{x)q{x\x) satisfies the expected distortion 
constraint. This is a standard minimization problem of a convex function 
over the convex set of all q(x\x) > 0 satisfying ^2^q(x\x) = 1 for all x 
and ^q{x\x)p{x)d{x, x) < D. 

We can use the method of Lagrange multipliers to find the solution. 
We set up the functional 

J ⑷ = H POOg (雄 ) l0 g A 

W } 2 x p ( x )^( x \ x ) 

+ 入 E ^ p(x)q(x\x)d(x, x) (10.116) 

X x 

+ 'Y^v{x)'Y^ j q{x\x), (10.117) 

^ je 


where the last term corresponds to the constraint that q(x\x) is a condi¬ 
tional probability mass function. If we let q(x) = p{x)q{x\x) be the 
distribution on X induced by q(x\x), we can rewrite J(q) as 


^ PMq(x\x)\0g q ^ X [ X ^ 

x x 

+ 入 ^ ^ p(x)q(x\x)d(x, x) (10.118) 

X x 

+ ^ v(x) q(x\x). (10.119) 


Differentiating with respect to q(x\x), we have 

d^M =p(x)log 帶 + 妳 ) -E P(^W)-L- P ( X ) 


+ Xp(x)d(x, x) + v(x) = 0. 


( 10 . 120 ) 


Setting log/x(x) = v(x)/p(x), we obtain 


p(x) log 


q{x\x) 

q(x) 


+ 入 f) + log /x(x) 


= 0 


( 10 . 121 ) 










or 
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|x)= 


q{x)e- U{x ^ 

g(x) 


Since 3( 戈 1 文） = 1， we must have 

fx(x) = J^q(x)e~ U(x ^ 


( 10 . 122 ) 


(10.123) 


q(x\x) 


q(x)e~ xd(x ^ 


⑻ r 入你，幻’ 

Multiplying this by p(x) and summing over all x, we obtain 

〜 〜 Y- P(x)e~ u ^ 

扑 ）= 咖） f E〆 / (作-城 

If q(x) > 0, we can divide both sides by q(x) and obtain 

p(x)e- xd(x ^ 


E 


，分） 


(10.124) 


(10.125) 


(10.126) 


for all x e X. We can combine these \X\ equations with the equation 
defining the distortion and calculate 入 and the \X\ unknowns q(x). We 
can use this and (10.124) to find the optimum conditional distribution. 

The above analysis is valid if q(x) is unconstrained (i.e. ， q(x) > 0 for 
all Jc). The inequality condition q(x) > 0 is covered by the Kuhn-Tucker 
conditions, which reduce to 


Substituting the 
minimum as 


dJ 


dq{x\x) 


0 if q{x\x) > 0, 


(10.127) 


> 0 if q(x\x) = 0. 
value of the derivative, we obtain the conditions for the 


p(x)e kd ( x ’幻 



if q(x) > 0, 
if q(x) = 0. 


(10.128) 

(10.129) 
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This characterization will enable us to check if a given q(x) is a solution 
to the minimization problem. However, it is not easy to solve for the 
optimum output distribution from these equations. In the next section we 
provide an iterative algorithm for computing the rate distortion function. 
This algorithm is a special case of a general algorithm for finding the 
minimum relative entropy distance between two convex sets of probability 
densities. 


10.8 COMPUTATION OF CHANNEL CAPACITY AND THE RATE 
DISTORTION FUNCTION 

Consider the following problem: Given two convex sets A and B in TZ n 
as shown in Figure 10.9, we would like to find the minimum distance 
between them: 


J m i n = min d(a, b )， (10.130) 

aeA,beB 

where d(a, b) is the Euclidean distance between a and b. An intuitively 
obvious algorithm to do this would be to take any point x e A, and find 
the y e B that is closest to it. Then fix this y and find the closest point in 
A. Repeating this process, it is clear that the distance decreases at each 
stage. Does it converge to the minimum distance between the two sets? 
Csiszar and Tusnady [155] have shown that if the sets are convex and 
if the distance satisfies certain conditions, this alternating minimization 
algorithm will indeed converge to the minimum. In particular, if the sets 
are sets of probability distributions and the distance measure is the relative 
entropy, the algorithm does converge to the minimum relative entropy 
between the two sets of distributions. 



FIGURE 10.9. Distance between convex sets. 





10.8 COMPUTATION OF CHANNEL CAPACITY AND RATE DISTORTION FUNCTION 


333 


To apply this algorithm to rate distortion, we have to rewrite the rate 
distortion function as a minimum of the relative entropy between two sets. 
We begin with a simple lemma. A form of this lemma comes up again 
in theorem 13.1.1, establishing the duality of channel capacity universal 
data compression. 

Lemma 10.8.1 Let p(x)p(y\x) be a given joint distribution. Then the 
distribution r(y) that minimizes the relative entropy D{p{x)p{y\x)\\p{x) 
r{y)) is the marginal distribution r*(y) corresponding to p{y\x): 

D{p{x)p{y\x)\\p{x)r*{y)) = min D(p(x)p(y\x)\\p(x)r(y)), (10.131) 

r(y) 


where 厂 *00 = [文 p{x)p{y\x). Also, 


max 

KM;y) 


E 


p{x)p{y\x)log 


r(x\y) 

P(x) 


= ^p(x)p(y\x)\og 

x,y 




(10.132) 


where 


r\x\y) = 


p{x)p{y\x) 

J2 x p( x )p(y\ x ) 


(10.133) 


Proof 


D(p(x)p(y\x)\\p(x)r(y)) - D(p(x)p(y\x)\\p(x)r*(y)) 


= Tp( X )p(y\ X )lo g P(x)piylx) 

^ P(x)r(y) 

(10.134) 

，、i p(x)p(y\x) 
>p(x)p(y\x)\og ..... 

P(x)r*(y) 

(10.135) 

= y^P(x)p(y\x)\og - 

(10.136) 

p 、i r *Cy) 

=(J) log / 、 

^ r(y) 

(10.137) 

= D(r*\\r) 

(10.138) 

>0. 

(10.139) 
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The proof of the second part of the lemma is left as an exercise. 


□ 


We can use this lemma to rewrite the minimization in the definition of 
the rate distortion function as a double minimization, 

R(D) = min min V" T] p(x)q(x\x)\og ) . 

r ⑴ q(x\x):J2 P(x)q(x\x)d(x,x)<D ^ ^ r(x) 

(10.140) 

If A is the set of all joint distributions with marginal p{x) that satisfy the 
distortion constraints and if B the set of product distributions p{x)r{x) 
with arbitrary r(i), we can write 


R(D) = min min D(p\\q). 

qeB peA 


(10.141) 


We now apply the process of alternating minimization, which is called the 
Blahut-Arimoto algorithm in this case. We begin with a choice of X and 
an initial output distribution r{x) and calculate the q{x\x) that minimizes 
the mutual information subject to the distortion constraint. We can use the 
method of Lagrange multipliers for this minimization to obtain 


q{x\x) 


r(x)e~ xd(x4) 

Ex r ^)e 


一入 J(X，X) * 


(10.142) 


For this conditional distribution q{x\x), we calculate the output distribu¬ 
tion r(x) that minimizes the mutual information, which by Lemma 10.8.1 


is 




(10.143) 


We use this output distribution as the starting point of the next iteration. 
Each step in the iteration, minimizing over ^(-|-) and then minimizing over 
r(-), reduces the right-hand side of (10.140). Thus, there is a limit, and 
the limit has been shown to be R(D) by Csiszar [139], where the value 
of D and R(D) depends on A. Thus, choosing 入 appropriately sweeps out 
the R(D) curve. 

A similar procedure can be applied to the calculation of channel capac¬ 
ity. Again we rewrite the definition of channel capacity, 


C 


max/(X; Y) 

r ⑻ 


max r ( x )p(y\ x )^°s 


r{x)p{y\x) 


r(x) 


尸 WE.' 


r{x')p{y\x') 

(10.144) 





as a double maximization using Lemma 10.8.1 ， 
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C = 


max max r(x)p(y\x) log 兮 ( 义丨 夕 ） . 

q{x\y) r{x) ^ ^ r{x) 


(10.145) 


In this case, the Csiszar-Tusnady algorithm becomes one of alternating 
maximization — we start with a guess of the maximizing distribution r{x) 
and find the best conditional distribution, which is, by Lemma 10.8.1, 


q{x\y)= 


r(x)p(y\x) 

J2 x r(x)p(y\x) 


(10.146) 


For this conditional distribution, we find the best input distribution 
r(x) by solving the constrained maximization problem with Lagrange 
multipliers. The optimum input distribution is 


f 、 \\ y (q(x\y)Y^ 

~ J ： x Uy(^\y)) piylx) 


(10.147) 


which we can use as the basis for the next iteration. 

These algorithms for the computation of the channel capacity and the 
rate distortion function were established by Blahut [65] and Arimoto [25] 
and the convergence for the rate distortion computation was proved by 
Csiszar [139]. The alternating minimization procedure of Csiszar and Tus- 
nady can be specialized to many other situations as well, including the EM 
algorithm [166], and the algorithm for finding the log-optimal portfolio 
for a stock market [123]. 


SUMMARY 

Rate distortion. The rate distortion function for a source X 〜 p(x) 
and distortion measure d(x, x) is 

R(D) = min /(X;Z), (10.148) 

p(x\x):J2( x ,x) p(x)p(x\x)d(x,x)<D 

where the minimization is over all conditional distributions p{x\x) for 
which the joint distribution p(x, x) = p(x)p(x\x) satisfies the expected 
distortion constraint. 
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d(x, x)= 


Rate distortion theorem. If R > R(D), there exists a sequence of 
codes X n {X n ) with the number of codewords \X n (-)\ < 2 nR with 
Ed(X n , X n {X n )) ^ D. If R < R(D), no such codes exist. 

Bernoulli source. For a Bernoulli source with Hamming distortion, 

R(D) = H(p) - H(D). (10.149) 

Gaussian source. For a Gaussian source with squared-error distortion, 

1 a 2 

R(D) =-log—. (10.150) 

Source - channel separation. A source with rate distortion R(D) can 
be sent over a channel of capacity C and recovered with distortion D 
if and only if R(D) < C. 

Multivariate Gaussian source. The rate distortion function for a mul¬ 
tivariate normal vector with Euclidean mean-squared-error distortion is 
given by reverse water-filling on the eigenvalues. 


PROBLEMS 

10.1 One-bit quantization of a single Gaussian random variable. Let 
X 〜 A/*(0, a 2 ) and let the distortion measure be squared error. 
Here we do not allow block descriptions. Show that the optimum 

reproduction points for 1-bit quantization are 土 and that the 

expected distortion for 1-bit quantization is Compare this 

with the distortion rate bound D = cr 2 2- 2R for 
R=l. 

10.2 Rate distortion function with infinite distortion. Find the rate dis¬ 
tortion function R(D) = min I (X ; X) for X 〜 Bernoulli (^) and 
distortion 
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10.3 Rate distortion for binary source with asymmetric distortion . 
p(x\x) and evaluate I(X\X) and D for 


X 〜 Bernoulli 



Fix 


d(x, x)= 


0 a 
b 0 


(The rate distortion function cannot be expressed in closed form.) 

10.4 Properties of R(D). Consider a discrete source X e X = 
{1, 2,..., m} with distribution p\, P 2 , …， Pm and a distortion 
measure d(i, j). Let R(D) be the rate distortion function for 
this source and distortion measure. Let d f (i, j) = d(i, j) — Wi be 
a new distortion measure, and let R f (D) be the corresponding 
rate distortion function. Show that R’ （ D) = R(D + w), where 
w = Y] Pi w i^ an d use this to show that there is no essential loss of 
generality in assuming that min^ d(i, x) = 0 (i.e., for each x e Af, 
there is one symbol x that reproduces the source with zero dis¬ 
tortion). This result is due to Pinkston [420]. 

10.5 Rate distortion for uniform source with Hamming distortion . 
Consider a source X uniformly distributed on the set {1, 2, … ， m}. 
Find the rate distortion function for this source with Hamming 
distortion; that is, 


d(x jf)- P if x = A 

10.6 Shannon lower bound for the rate distortion function . Consider 
a source X with a distortion measure d(x, x) that satisfies the 
following property: All columns of the distortion matrix are per¬ 
mutations of the set {d\, 屯 …， d m }. Define the function 

♦ (D) = max //(p). (10.151) 

Pidi<D 


The Shannon lower bound on the rate distortion function [485] 
is proved by the following steps: 

(a) Show that (j)(D) is a concave function of D. 

(b) Justify the following series of inequalities for I (X] X) if 
Ed(X, X) < D, 

I{X\X) = H(X) - H(X\X) (10.152) 
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= H(X)-J2p^)H(X\X = x) 

x 

(10.153) 

> 雕 )-2> 削込 ) 

(10.154) 

> H(X)-(p(^p(x)D^j 

(10.155) 

> H(X)-(f>(D), 

(10.156) 


where p(x\x)d(x, x). 

(c) Argue that 

R(D) > H(X) - 0(D), (10.157) 

which is the Shannon lower bound on the rate distortion func¬ 
tion. 

(d) If, in addition, we assume that the source has a uniform 
distribution and that the rows of the distortion matrix are per¬ 
mutations of each other, then R(D) = H(X) — 0(D) (i.e., the 
lower bound is tight). 

10.7 Erasure distortion. Consider X 〜 Bernoulli (^), and let the dis¬ 
tortion measure be given by the matrix 

d(x,x) = ^ j . (10.158) 

Calculate the rate distortion function for this source. Can you 
suggest a simple scheme to achieve any value of the rate distortion 
function for this source? 

10.8 Bounds on the rate distortion function for squared-error distortion . 
For the case of a continuous random variable X with mean zero 
and variance a 2 and squared-error distortion, show that 

1 1 a 2 

h(X) - - log(27r^D) < R(D) < ^og—. (10.159) 


For the upper bound, consider the following joint distribution: 
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Are Gaussian random variables harder or easier to describe than 
other random variables with the same variance? 

10.9 Properties of optimal rate distortion code. A good (R ， D) rate 
distortion code with R ^ R(D) puts severe constraints on the rela¬ 
tionship of the source X n and the representations X n . Examine the 
chain of inequalities (10.58-10.71) considering the conditions for 
equality and interpret as properties of a good code. For example, 
equality in (10.59) implies that X n is a deterministic function 
of 

10.10 Rate distortion . Find and verify the rate distortion function R(D) 
for X uniform on A" = {1, 2, , 2m} and 


d(x, x)= 


1 for x — x odd, 
0 for x — x even, 


where X is defined on A = {1 ， 2, ... ， 2m}. (You may wish to use 
the Shannon lower bound in your argument.) 

10.11 Lower bound. Let 


X 〜 

and 

f x A e~ x ^dx 
f e~ x4 dx 

Define g(a) = maxh(X) over all densities such that EX 4 < a. 
Let R(D) be the rate distortion function for X with the density 
above and with distortion criterion d(x, x) = (x — x) 4 . Show that 
R(D) > g(c) - g(D). 
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10.12 Adding a column to the distortion matrix. Let R(D) be the rate 
distortion function for an i.i.d. process with probability mass func¬ 
tion p{x) and distortion function d(x, x), x e X, x e X. Now 
suppose that we add a new reproduction symbol xo to X with asso¬ 
ciated distortion d(x, xo), x e X. Does this increase or decrease 
R(D )， and why? 

10.13 Simplification. Suppose that X = {1 ， 2, 3, 4} ， X= {1, 2, 3, 4}, 
p{i) = i = 1, 2, 3, 4, and X\, X 2 , ... are i.i.d •〜 p(x). The 
distortion matrix d{x, x) is given by 



1 

2 

3 

4 

1 

0 

0 

1 

1 

2 

0 

0 

1 

1 

3 

1 

1 

0 

0 

4 

1 

1 

0 

0 


(a) Find R(0), the rate necessary to describe the process with 
zero distortion. 

(b) Find the rate distortion function R(D). There are some irrel¬ 
evant distinctions in alphabets X and X, which allow the 
problem to be collapsed. 

(c) Suppose that we have a nonuniform distribution p{i) = p“ 
i = 1,2, 3,4. What is R(D)1 

10.14 Rate distortion for two independent sources • Can one compress 
two independent sources simultaneously better than by compress¬ 
ing the sources individually? The following problem addresses this 
question. Let {Xi} be i.i.d. ~ p(x) with distortion d(x, x) and rate 
distortion function Rx(D). Similarly, let {Yi} be i.i.d . 〜 p(y) with 
distortion d(y, y) and rate distortion function Ry(D). Suppose we 
now wish to describe the process {(Xi, Yi)} subject to distortions 
Ed(X, X) < D\ and Ed(Y, Y) < Z> 2 - Thus, a rate Rx,y(D\, D 2 ) 
is sufficient, where 

Rx,y(D u D 2 )= min /(Z,F;X,F). 

p{x,y\x,yy.Ed{X,X)<D x ,Ed{Y,Y)<D 2 

Now suppose that the {X/} process and the {F/} process are inde¬ 
pendent of each other. 

(a) Show that 


Rxj{D\, D2) > Rx{D\) + Ry{D2). 
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(b) Does equality hold? 

Now answer the question. 

10.15 Distortion rate function. Let 

D(R) = min Ed(X, X) (10.160) 

p(x\x):I(X;X)<R 

be the distortion rate function. 

(a) Is D(R) increasing or decreasing in /?? 

(b) Is D(R) convex or concave in /?? 

(c) Converse for distortion rate functions: We now wish to prove 
the converse by focusing on D(R). Let X\, X 2 ,..., X n be 
i.i.d • 〜 p(x). Suppose that one is given a (2 nR , n) rate dis¬ 
tortion code X 72 4 i{X n ) X n (i(X n )), with i(X n ) e l nR , 
and suppose that the resulting distortion is /) = Ed(X n ， X n 
(i(X n ))). We must show that D > D(R). Give reasons for the 
following steps in the proof: 


= Ed(X n ,X n (i(X n ))) 

(10.161) 

=E-YdiX^Xi) 

H i=l 

(10.162) 

11 i = \ 

(10.163) 

i=l 

(10.164) 

(‘) D (读 " U ')) 

(10.165) 

窆 d (士 /(m”) 

(10.166) 

> D(R). 

(10.167) 


10.16 Probability of conditionally typical sequences. In Chapter 7 we 
calculated the probability that two independently drawn sequences 
X n and Y n are weakly jointly typical. To prove the rate distor¬ 
tion theorem, however, we need to calculate this probability when 
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one of the sequences is fixed and the other is random. The tech¬ 
niques of weak typicality allow us only to calculate the average 
set size of the conditionally typical set. Using the ideas of strong 
typicality, on the other hand, provides us with stronger bounds 
that work for all typical x n sequences. We outline the proof that 
Vv{{x n , Y n ) e ^ 2~ nI( ' X,Y ^ for all typical x n . This approach 

was introduced by Berger [53] and is fully developed in the book 
by Csiszar and Korner [149]. 

Let Yi) be drawn i.i.d •〜 p(x, y). Let the marginals of X and 
Y be p{x) and p(y), respectively. 

(a) Let A : ⑻ be the strongly typical set for X. Show that 

| A : ⑻ (气 (10.168) 


(Hint: Theorems 11.1.1 and 11.1.3.) 

(b) The joint type of a pair of sequences (x n , y n ) is the proportion 
of times (xi, yi) = (a, b) in the pair of sequences: 


p x n y n (ci^b) — _ N{ci^b\x , y ) — —〉: I (x/ = ci, yi — b). 

i=l 

(10.169) 

The conditional type of a sequence y n given x n is a stochastic 
matrix that gives the proportion of times a particular element 
of y occurred with each element of X in the pair of sequences. 
Specifically, the conditional type V y n\ x n{b\a) is defined as 


V y n\ x n{b\a )= 


N(a,b\x n ,y n ) 

N(a\x n ) 


(10.170) 


Show that the number of conditional types is bounded by (n + 

(c) The set of sequences y n e y n with conditional type V with 
respect to a sequence x n is called the conditional type class 
Ty{x n ). Show that 


(n + 1 ) 剛 


2 nH ™ < \T v (x n )\<2 nHmx) 


(10.171) 


(d) The sequence y n G y n is said to be €-strongly conditionally 
typical with the sequence x n with respect to the conditional 
distribution V(-|-) if the conditional type is close to V. The 
conditional type should satisfy the following two conditions: 
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(i) For all (a, b) e X x y with V (b\a) > 0, 

^ |#(a, b\x n ,y n ) - V{b\a)N{a\x n )\ < 

(10.172) 

(ii) N(a, b\x n , y n ) = 0 for all {a, b) such that V(b\a) = 0. 
The set of such sequences is called the conditionally typi¬ 
cal set and is denoted A^ n \Y\x n ). Show that the number 
of sequences y n that are conditionally typical with a given 
x n e X n is bounded by 

(" + :) 剛 ^ I'* ⑻卵 ”)1 

<(n + l) l ^ l|y| 2 w( - H(F|x)+ei) , (10.173) 
where 6i —> 0 as 6 — > 0. 

(e) For a pair of random variables (X, Y) with joint distribution 
p(x ， y), the €-strongly typical set is the set of sequences 

(x n , y n ) G X n x y n satisfying 

⑴ 

^N{a, b\x n ,y n ) - p(a, b) < (10.174) 

for every pair (a, b) e X x y with p(a, b) > 0. 

(ii) N{a, b\x n , y n ) = 0 for all (a, b) e X x y with 
p{a, b) = 0. 

The set of ^-strongly jointly typical sequences is called the 
€-strongly jointly typical set and is denoted A^ n \X, Y). Let 
(X, Y) be drawn i.i.d •〜 p(x, y). For any x n such that there 
exists at least one pair (x n , y n ) e A: ⑻（ X ， Y), the set of se¬ 
quences y n such that (x n , y n ) e A : ⑻ satisfies 

< | {/ : (x»,/)G A*^}l 

<(n + 1 ) 1 抑別 2 "( 單 1 幻 + 吩 ))， (10.175) 

where 5(6) ^ 0 as 6 ^ 0. In particular, we can write 

2 n(H(Y\X)-€ 2 ) < ^ y n . (/ ， y n^ ^ 义 *⑻ }| < ( 付， 

(10.176) 
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where we can make 62 arbitrarily small with an appropriate 
choice of € and n. 

(f) Let Y\, Y 2 ,..., Y n be drawn i.i.d p{yi). For x n e A^ n \ 
the probability that (x n , Y n ) G A: ⑻ is bounded by 

2 ~n(I(X;Y)-\-€ 3 ) < G A * ⑻） < 2 一一 e 3) 

(10.177) 


where 63 goes to 0 as 6 ^ 0 and n —> 00. 


10.17 Source-channel separation theorem with distortion. Let V\, 
V 2 ,.. •, V n be a finite alphabet i.i.d. source which is encoded 
as a sequence of n input symbols X n of a discrete memoryless 
channel. The output of the channel Y n is mapped onto the recon- 
struction alphabet V n = g{Y n ). Let D = Ed(V n , V n ) = ^ 
Ed(Vi, Vi) be the average distortion achieved by this combined 
source and channel coding scheme. 


V n — ^X n (V n ) —►- Channel Capacity C — ^Y n - ^V n 


(a) Show that if C > R(D )， where R(D) is the rate distortion 
function for V, it is possible to find encoders and decoders 
that achieve a average distortion arbitrarily close to D. 

(b) (Converse) Show that if the average distortion is equal to D, 
the capacity of the channel C must be greater than R(D). 

10.18 Rate distortion. Let d(x, x) be a distortion function. We have 
a source X 〜 p(x). Let R(D) be the associated rate distortion 
function. 

(a) Find R(D) in terms of R(D )， where R(D) is the rate 
distortion function associated with the distortion d(x, x)= 
d(x, x) + a for some constant a > 0. (They are not equal.) 

(b) Now suppose that d(x, x) > 0 for all x, x and define a new 
distortion function d*(x, x) = bd(x, x), where b is some num¬ 
ber > 0. Find the associated rate distortion function /?*(D) in 
terms of R(D). 

(c) Let X^N(0,a 2 ) and d(x, x) = 5(x — x) 2 + 3. What is 
R(D)? 
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10.19 Rate distortion with two constraints. Let X/ be iid 〜 p(x). We 
are given two distortion functions, d\(x, x) and d 2 (x, x). We 
wish to describe X n at rate R and reconstruct it with distortions 
Ed\(X n , Xj) < Di, and Ed 2 (X n , X^) < Z>2, as shown here: 

x n — i(X n ) {X n x {i),X n 2 {i)) 

D l = EddX^X'D 
D 2 = Ed 2 {X n v X^). 

Here /(•) takes on 2 nR values. What is the rate distortion function 
R(D u D 2 )1 

10.20 Rate distortion . Consider the standard rate distortion problem, 
Xi i.i.d .〜 p(x), X n i{X n ) X n , |/(-)| = 2 nR . Consider two 
distortion criteria d\(x, x) and diOc, x). Suppose that d\(x, x) < 
d 2 (x,x) for all x G A", x G X. Let R\(D) and R 2 (D) be the cor¬ 
responding rate distortion functions. 

(a) Find the inequality relationship between R\(D) and R 2 (D)• 

(b) Suppose that we must describe the source {Xi} at the mini¬ 
mum rate R achieving d\ (X n , X^) < D and d 2 (X n , X^) < D. 
Thus, 

f X1(i(X n )) 

X n i(X n ) \ 

[X n 2 (i(X n )) 

and |/(-)| =2 nR . 

Find the minimum rate R. 


HISTORICAL NOTES 

The idea of rate distortion was introduced by Shannon in his original 
paper [472]. He returned to it and dealt with it exhaustively in his 1959 
paper [485], which proved the first rate distortion theorem. Meanwhile, 
Kolmogorov and his school in the Soviet Union began to develop rate 
distortion theory in 1956. Stronger versions of the rate distortion theorem 
have been proved for more general sources in the comprehensive book by 
Berger [52]. 

The inverse water-filling solution for the rate distortion function for 
parallel Gaussian sources was established by McDonald and Schultheiss 
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[381]. An iterative algorithm for the calculation of the rate distortion 
function for a general i.i.d. source and arbitrary distortion measure was 
described by Blahut [65], Arimoto [25], and Csiszar [139]. This algorithm 
is a special case of a general alternating minimization algorithm due to 
Csiszar and Tusnady [155]. 


CHAPTER 11 

INFORMATION THEORY 
AND STATISTICS 


We now explore the relationship between information theory and statistics. 
We begin by describing the method of types, which is a powerful technique 
in large deviation theory. We use the method of types to calculate the 
probability of rare events and to show the existence of universal source 
codes. We also consider the problem of testing hypotheses and derive the 
best possible error exponents for such tests (the Chernoff-Stein lemma). 
Finally, we treat the estimation of the parameters of a distribution and 
describe the role of Fisher information. 


11.1 METHOD OF TYPES 

The AEP for discrete random variables (Chapter 3) focuses our attention 
on a small subset of typical sequences. The method of types is an even 
more powerful procedure in which we consider sequences that have the 
same empirical distribution. With this restriction, we can derive strong 
bounds on the number of sequences with a particular empirical distribution 
and the probability of each sequence in this set. It is then possible to derive 
strong error bounds for the channel coding theorem and prove a variety 
of rate distortion results. The method of types was fully developed by 
Csiszar and Korner [149], who obtained most of their results from this 
point of view. 

Let X\, X 2 ,..., X n be 3 . sequence of n symbols from an alphabet X = 
{a\, (22,, ci\x\}- We use the notation x n and x interchangeably to denote 
a sequence X\,X 2 ,..., x n . 

Definition The type P x (or empirical probability distribution) of a se¬ 
quence x\, X 2 ,..., x n is the relative proportion of occurrences of each 
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symbol of X (i.e .， P x {a) = N(a\x)/n for all a e X, where A/"(«|x) is the 
number of times the symbol a occurs in the sequence x g X n ). 

The type of a sequence x is denoted as P x . It is a probability mass 
function on X. (Note that in this chapter, we will use capital letters to 
denote types and distributions. We also loosely use the word distribution 
to mean a probability mass function.) 


Definition The probability simplex in TZ m is the set of points x = 
(x\, X2 ,..., x m ) G TZ m such that Xi > 0, Y^iLi A = 1. 

The probability simplex is an (m — l)-dimensional manifold in 
m-dimensional space. When m = 3, the probability simplex is the 
set of points {(x\, X 2 , : x\ > 0, X2 > 0, > 0, xi + ^2 + X3 = 1} 

(Figure 11.1). Since this is a triangular two-dimensional flat in Ti?, we 
use a triangle to represent the probability simplex in later sections of this 
chapter. 


Definition Let V n denote the set of types with denominator n. 

For example, if X= {0, 1}, the set of possible types with denominator 
n is 


V n = 


CP(0), P {\)) : 





( 11 . 1 ) 


Definition If P e V n , the set of sequences of length n and type P is 
called the type class of P, denoted T(P)\ 


T(P) = {xeX n : P x = P}. (11.2) 


The type class is sometimes called the composition class of P. 



FIGURE 11.1. Probability simplex in V?. 
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Example 11.1.1 Let 1= {1, 2, 3}，a ternary alphabet. Let x = 11321. 
Then the type P x is 

P x (l) = 备， PA2) = l ~, 尸 x(3) = 去 . (11.3) 

The type class of P x is the set of all sequences of length 5 with three Vs, 
one 2, and one 3. There are 20 such sequences, and 

T(P X ) = {11123, 11132, 11213,… ， 32111}. (11.4) 


The number of elements in T(P) is 

^ 5 、 

v 3, 1,1, 


|r(p)| 


5! 


3! 1!1! 


20 . 


(11.5) 


The essential power of the method of types arises from the following 
theorem, which shows that the number of types is at most polynomial 


in n. 


Theorem 11.1.1 

Wn\ < (n + 1) 1 ^ 1 . (11.6) 


Proof: There are |A1 components in the vector that specifies P x . The 
numerator in each component can take on only n + l values. So there are 
at most (n + 1)IW choices for the type vector. Of course, these choices 
are not independent (e.g.，the last choice is fixed by the others). But this 
is a sufficiently good upper bound for our needs. □ 

The crucial point here is that there are only a polynomial number of 
types of length n. Since the number of sequences is exponential in n, it 
follows that at least one type has exponentially many sequences in its 
type class. In fact, the largest type class has essentially the same number 
of elements as the entire set of sequences, to first order in the exponent. 

Now, we assume that the sequence X\, X 2 ,..., X n is drawn i.i.d. 
according to a distribution Q{x). All sequences with the same type have 
the same probability, as shown in the following theorem. Let Q n {x n )= 
YYi=i Q{^i) denote the product distribution associated with Q. 

Theorem 11.1.2 IfX\, X 2 , …， X n are drawn i.i.d. according to Q(x), 
the probability ofx depends only on its type and is given by 

Qn ⑻ = 2 - n ( H ( p x)+D(P x \\Q)) 


(11.7) 
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Proof 



G ,2 (x) 二 

n 

= n 2(^) 

i=\ 

(11.8) 


= n e ⑷ •、） 

aeX 

(11.9) 


= n Q(a)_ 

aeX 

(11.10) 


- J~~J 2 nP x(a)logQ ⑷ 
aeX 

(11.11) 


= 0 2 n( ^ Px ^ l0g lo ^ Px(a)-\-P x (a) log P x (a)) 

aeX 

(11.12) 


_ 2 n (-尸 x ⑷ log ■^^ + / > x (a) log P x ⑷） 

(11.13) 



(11.14) 

Corollary Ifx 

is in the type class of Q, then 



Q n (x) = 2~ nH(Q) . 

(11.15) 


Proof: If x g T(0)，then P x = Q, which can be substituted into (11.14). 

□ 

Example 11.1.2 The probability that a fair die produces a particular 
sequence of length n with precisely n/6 occurrences of each face (n is a 
multiple of 6) is 2- nH (H" ， l) = 6~ n . This is obvious. However, if the 
die has a probability mass function (| ， U ， 長， j 2 , 0), the probability of 
observing a particular sequence with precisely these frequencies is pre¬ 
cisely 2 -w//( 3 5 3 5 6 5 T2>T2 50) for n a multiple of 12. This is more interesting. 

We now give an estimate of the size of a type class T(P). 

Theorem 11.1.3 (Size of a type class T (P)) For any type P G V n , 

- ~< \t{P)\ < r H(P \ (11.16) 

{n + 1)1^1 


Proof: The exact size of T(P) is easy to calculate. It is a simple combi¬ 
natorial problem —— the number of ways of arranging nP(a\), nP(a 2 ),..., 




nP{a\x\) objects in a sequence, which is 
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(11.17) 

This value is hard to manipulate, so we derive simple exponential bounds 
on its value. 

We suggest two alternative proofs for the exponential bounds. The first 
proof uses Stirling’s formula [208] to bound the factorial function, and 
after some algebra, we can obtain the bounds of the theorem. We give an 
alternative proof. We first prove the upper bound. Since a type class must 
have probability < 1, we have 

1 > P n (T(P)) 

(11.18) 

= J2 ，( x) 

xeT(P) 

(11.19) 

= 釋） 

X€T(P) 

(11.20) 

=\T(P)\2~ nH{p \ 

(11.21) 

using Theorem 11.1.2. Thus, 


\T(P)\ < 2 nH{P \ 

(11.22) 


Now for the lower bound. We first prove that the type class T(P) 
has the highest probability among all type classes under the probability 
distribution P: 


P n (T(P)) > P n {T{P)) for all P eV n 
We lower bound the ratio of probabilities, 

P n (T(P)) \T(P)\UaeX P(^ nP(a) 


P n (T(P)) 


\T{P)\\\ aeX P{.aY P ^ 

)WaexP^T P ^ 
) \\ae X P ^ nHa) 


\nP(ai), nP(a 2 ),...,nP(a\ X \)/ 

fZ ~ " : 

\nP(ai), nP(a 2 ),...,nP(a\ X \) 


rr (” 户⑷） 广 (尸⑷ - p ⑷) 


aeX 


{nP{d))\ 


(11.23) 


(11.24) 


(11.25) 


(11.26) 
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Now using the simple bound (easy to prove by separately considering the 
cases m > n and m < n) 

jfl ! 

— >n m ~\ (11.27) 

n\ 


we obtain 

pn{TiP)) > rr (” p ⑷ 〆 ( a )-« p ⑷声⑷) ( 11 . 28 ) 

P"(T(P)) ~ V J X 

=]~[ ⑷- p ⑷） (11.29) 

= n ^(ZaeX P M-T,aeX P(»)) (11.30) 

=n ,!(1_1) (11.31) 

=1. (11.32) 


Hence, P n (T(P)) > P n {T{P)). The lower bound now follows easily 


from this result, since 

1 = Y. pn ( T _ (H-33) 

Q^Pn 

< Y max (7( 2)) (11.34) 

QeV n 

=P n (T{P)) (11.35) 

< {n + \) w P n {T{P)) (11.36) 

= (n + l) m pn ( x ) (11.37) 

xeT(P) 

= (n + l) lxl 2~ ,,H(P) (11.38) 

xeT(P) 

= (n + l) m \T(P)\2 - nH(P) , (11.39) 


where (11.36) follows from Theorem 11.1.1 and (11.38) follows from 
Theorem 11.1.2. □ 


We give a slightly better approximation for the binary case. 
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Example 11.1.3 (Binary alphabet) In this case, the type is defined 
by the number of Vs in the sequence, and the size of the type class is 
therefore We show that 


、nH 


(») 


< 


f n 、 


< 2 ' 


nH &> 


(11.40) 


These bounds can be proved using Stirling’s approximation for the fac¬ 
torial function (Lemma 17.5.1). But we provide a more intuitive proof 
below. 

We first prove the upper bound. From the binomial formula, for any p, 




(11.41) 


k=0 


Since all the terms of the sum are positive for 0 < p < l, each of the 
terms is less than 1. Setting p = k/n and taking the k\h term, we get 


Hence, 



(11.42) 

(11.43) 

(11.44) 

(11.45) 


(11.46) 


For the lower bound, let 5 be a random variable with a binomial 
distribution with parameters n and p. The most likely value of S is 
S = {np}. This can easily be verified from the fact that 


P(S = i + l) n-i p 


P(S = i) 


i + 1l-p 


(11.47) 


and considering the cases when i < np and when i > np. Then, since 
there are n + l terms in the binomial sum, 
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k=0 


E ( t )/(l - P) n _ k <{n + l)max ^J/(l - p) n _ k (11.48) 

(11.49) 




Now let p = k/n. Then we have 


<(^ + 1 ) 


^n\ (k 


k 


,k) [n) I' - 1) 

which by the arguments in (11.45) is equivalent to 


n—k 




or 


> 2 nH 
、 k) - n + 1 




⑷ 


Combining the two results, we see that 

= 2 . 


f n' 




(11.50) 


(11.51) 


(11.52) 


(11.53) 


A more precise bound can be found in theorem 17.5.1 when k ^ 0 ox n. 

Theorem 11.1.4 (Probability of type class) for any P e V n and any 
distribution Q, the probability of the type class T (P) under Q n is2~ nD ^ p ^^ 
to first order in the exponent. More precisely, 

—— l _^2~ nmPm < Q n (T(P)) < r”D(p|ie). (11.54) 

(n + 1)1^1 


Proof: We have 


Q n (T(P))= Q'\x) 

xeT(P) 

= ^2 2 ~ n(D(Pm+H(p)) 

xeT(P) 

=\T (P)\2~ n{D(Pm+H(P) \ 


(11.55) 

(11.56) 

(11.57) 
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by Theorem 11.1.2. Using the bounds on \T(P)\ derived in Theorem 
11.1.3, we have 

—— ] _^2~ nD(pm < Q n {T{P)) < 2~ nD(pm . □ (11.58) 

(n + _ 

We can summarize the basic theorems concerning types in four equa¬ 
tions: 


Wn\ < (n + 1) 1 ^ 1 , 

(11.59) 

= 2-n(O(^xll0)+^(^x)) ) 

(11.60) 

ir(p)| =2 nH(P \ 

(11.61) 

Q n {T{P)) = 2~ nD{P[lQ) . 

(11.62) 


These equations state that there are only a polynomial number of types 
and that there are an exponential number of sequences of each type. We 
also have an exact formula for the probability of any sequence of type P 
under distribution Q and an approximate formula for the probability of a 
type class. 

These equations allow us to calculate the behavior of long sequences 
based on the properties of the type of the sequence. For example, for 
long sequences drawn i.i.d. according to some distribution, the type of 
the sequence is close to the distribution generating the sequence, and we 
can use the properties of this distribution to estimate the properties of the 
sequence. Some of the applications that will be dealt with in the next few 
sections are as follows: 

• The law of large numbers 

• Universal source coding 

• Sanov’s theorem 

• The Chemoff-Stein lemma and hypothesis testing 

• Conditional probability and limit theorems 


11.2 LAW OF LARGE NUMBERS 

The concept of type and type classes enables us to give an alternative 
statement of the law of large numbers. In fact, it can be used as a proof 
of a version of the weak law in the discrete case. The most important 
property of types is that there are only a polynomial number of types, and 
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an exponential number of sequences of each type. Since the probability 
of each type class depends exponentially on the relative entropy distance 
between the type P and the distribution Q, type classes that are far from 
the true distribution have exponentially smaller probability. 

Given an 6 > 0, we can define a typical set Tq of sequences for the 
distribution Q n as 

T e Q = {x n : D{P xn \\Q)<e}. (11-63) 

Then the probability that x n is not typical is 

1 - Q\Tq)= [ Q'\T{P)) (11.64) 

P:D{P\\Q)>e 

< J2 2~ nD{pm (Theorem 11.1.4) (11.65) 

P ： D(P\\Q)>e 

< V 2~ ne 

P:D{P\\Q)>e 

< (n + l) |A ， l 2 _nf (Theorem 11.1.1) 

= 2 -+-W 手)， 

which goes to 0 as n ^ oc. Hence, the probability of the typical set Tq 
goes to 1 as n ^ oc. This is similar to the AEP proved in Chapter 3, 
which is a form of the weak law of large numbers. We now prove that 
the empirical distribution Px n converges to P. 

Theorem 11.2.1 Let X\, X 2 , … ， X n be i.i.d. 〜 P(x). Then 

t 1 log(n+l) -v 

Pr{D(P x n\\P) >€} < (11.69) 


( 11 . 66 ) 

(11.67) 

( 11 . 68 ) 


and consequently, D(P x n\\P) —> 0 with probability 1. 


Proof: The inequality (11.69) was proved in (11.68). Summing over n, 
we find that 


00 

J^MD(P x r ， \\P) >€} < 00 . (11.70) 





11.3 UNIVERSAL SOURCE CODING 


357 


Thus, the expected number of occurrences of the event {/)(/V||P) > e} 
for all n is finite, which implies that the actual number of such occur¬ 
rences is also finite with probability 1 (Borel-Cantelli lemma). Hence 
D{P x n\\P) —> 0 with probability 1. □ 

We now define a stronger version of typicality than in Chapter 3. 

Definition We define the strongly typical set A : ⑻ to be the set of 
sequences in X n for which the sample frequencies are close to the true 
values: 


_ . 

xe X n : 

1 

-N(a\x) - P(a) 
n 

€ 

< - , 

1^1 

if P(a) > 0 


N(a\x) = 0 


if P(a) = 0 


(11:71) 

Hence, the typical set consists of sequences whose type does not differ 
from the true probabilities by more than e/\X\ in any component. By the 
strong law of large numbers, it follows that the probability of the strongly 
typical set goes to 1 as n —> oc. The additional power afforded by strong 
typicality is useful in proving stronger results, particularly in universal 
coding, rate distortion theory, and large deviation theory. 

11.3 UNIVERSAL SOURCE CODING 

Huffman coding compresses an i.i.d. source with a known distribution 
p(x) to its entropy limit H(X). However, if the code is designed for 
some incorrect distribution q(x), a penalty of D(p\\q) is incurred. Thus, 
Huffman coding is sensitive to the assumed distribution. 

What compression can be achieved if the true distribution p(x) is 
unknown? Is there a universal code of rate R, say, that suffices to describe 
every i.i.d. source with entropy H(X) < R? The surprising answer is yes. 
The idea is based on the method of types. There are 2 nH 、 p 、 sequences of 
type P. Since there are only a polynomial number of types with denom¬ 
inator n, an enumeration of all sequences x n with type P x n such that 
H{P x n) < R will require roughly nR bits. Thus, by describing all such 
sequences, we are prepared to describe any sequence that is likely to arise 
from any distribution Q having entropy H{Q) < R. We begin with a 
definition. 

Definition A fixed-rate block code of rate R for a source X\, X2 ,..., 
X n which has an unknown distribution Q consists of two mappings: the 
encoder, 


fn'^ n ^ { 1 , 2 ,..., 2 ^}, 


( 11 . 72 ) 
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and the decoder, 

(f) n : {\,2, ...,2 nR ) X n . (11.73) 

Here R is called the rate of the code. The probability of error for the 
code with respect to the distribution Q is 

p” = Q n (X n : Mfn(X n )) ^ X n ) (11.74) 


Definition A rate R block code for a source will be called universal 
if the functions f n and (j) n do not depend on the distribution Q and if 
P^ n) 4 0 as n 4 oo if /? > H(Q). 

We now describe one such universal encoding scheme, due to Csiszar 
and Korner [149], that is based on the fact that the number of sequences 
of type P increases exponentially with the entropy and the fact that there 
are only a polynomial number of types. 

Theorem 11.3.1 There exists a sequence of (2 nR , n) universal source 
codes such that 0 for every source Q such that H(Q) < R. 


Proof: Fix the rate R for the code. Let 

log(n + 1) 

R n = R- |^|— - 

n 

Consider the set of sequences 


Then 


A = {xeX n :H(P x )<R n }. 


|A| = \T(P)\ 

PeVn ： H(P)<R n 

< 〉: 2 n H(P) 

PeVn ： H(P)<R n 

2 nR n 

PeVn ： H(P)<R n 

< {n + \) m 2 nRn 

= 2 n(R n +\^\ 1 ^^) 

= 2 nR . 


(11.75) 

(11.76) 

(11.77) 

(11.78) 

(11.79) 

(11.80) 

(11.81) 

(11.82) 
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By indexing the elements of A, we define the encoding function f n as 


fn(X)= 


index of x in A if x e A, 
0 otherwise. 


(11.83) 


The decoding function maps each index onto the corresponding element 
of A. Hence all the elements of A are recovered correctly, and all the 
remaining sequences result in an error. The set of sequences that are 
recovered correctly is illustrated in Figure 11.2. 

We now show that this encoding scheme is universal. Assume that the 
distribution of X\, X 2 ,..., is Q and H(Q) < R. Then the probability 
of decoding error is given by 


=1 - Q\A) 

(11.84) 

= E 2W)) 

(11.85) 

P:H(P)>R n 

<(n + max Q n (T(P)) 

(11.86) 

P:H(P)>R n 

< ^ _l_ nmin 户 : 打 ( 户)>知 D ( p \\Q) 

(11.87) 


Since 个 /? and H(Q) < R ，there exists no such that for all n > no, 
R n > H(Q). Then for n > no, minp ： H(p)>R n D(P\\Q) must be greater 
than 0, and the probability of error ⑻ converges to 0 exponentially fast 
as n ^ 00 . 



FIGURE 11.2. Universal code and the probability simplex. Each sequence with type that 
lies outside the circle is encoded by its index. There are fewer than 2 nR such sequences. 
Sequences with types within the circle are encoded by 0. 
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H(Q) Rate of code 

FIGURE 11.3. Error exponent for the universal code. 

On the other hand, if the distribution Q is such that the entropy H(Q) 
is greater than the rate /?, then with high probability the sequence will 
have a type outside the set A. Hence, in such cases the probability of 
error is close to 1. 

The exponent in the probability of error is 

Dl Q = min D(P\\Q), (11.88) 

P:H(P)>R 

which is illustrated in Figure 11.3. □ 

The universal coding scheme described here is only one of many such 
schemes. It is universal over the set of i.i.d. distributions. There are other 
schemes, such as the Lempel-Ziv algorithm, which is a variable-rate uni¬ 
versal code for all ergodic sources. The Lempel-Ziv algorithm, discussed 
in Section 13.4, is often used in practice to compress data that cannot be 
modeled simply, such as English text or computer source code. 

One may wonder why it is ever necessary to use Huffman codes, which 
are specific to a probability distribution. What do we lose in using a 
universal code? Universal codes need a longer block length to obtain 
the same performance as a code designed specifically for the probability 
distribution. We pay the penalty for this increase in block length by the 
increased complexity of the encoder and decoder. Hence, a distribution 
specific code is best if one knows the distribution of the source. 

11.4 LARGE DEVIATION THEORY 

The subject of large deviation theory can be illustrated by an example. 
What is the probability that ^ ^ X/ is near ^ if X\, X 2 ,..., X n are drawn 
i.i.d. Bernoulli(|)? This is a small deviation (from the expected outcome) 


c 9uodx 0 JOJJ LU 
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and the probability is near 1. Now what is the probability that ^ 
is greater than | given that X\, X 2 ,..., X n are Bemoulli(^)? This is 
a large deviation, and the probability is exponentially small. We might 
estimate the exponent using the central limit theorem, but this is a poor 
approximation for more than a few standard deviations. We note that 
士 E 兄 =! is equivalent to P x = (|, |). Thus, the probability that X n is 
near | is the probability that type Px is near (|, |). The probability of 

this large deviation will turn out to be ^ 2— wD ((D)ll(L)). In this section 
we estimate the probability of a set of nontypical types. 

Let £ be a subset of the set of probability mass functions. For example, 
E may be the set of probability mass functions with mean /x. With a slight 
abuse of notation, we write 

Q n (E) = Q'\EnV n )= Q'\x). (11.89) 

If E contains a relative entropy neighborhood of Q, then by the weak 
law of large numbers (Theorem 11.2.1), Q n {E) 1. On the other hand, 
if E does not contain 2 or a neighborhood of Q, then by the weak law 
of large numbers, Q n (E) —> 0 exponentially fast. We will use the method 
of types to calculate the exponent. 

Let us first give some examples of the kinds of sets E that we are 
considering. For example, assume that by observation we find that the 
sample average of g(X) is greater than or equal to a [i.e., ^ S( x i) — a ]- 
This event is equivalent to the event Px ^ E C\ V n , where 

£ = j P : 2 g(a)P(a) >a\, 

[ aeX J 

because 

1 n 

- ^2 g(Xi) >a P x (a)g(a) > a 

i=l a&X 

^ Px ^ E C\V n . 

Thus, 

Pr(-y"g(^)>a) = Q n (EnV n )=Q n (E). 


(11.90) 

(11.91) 

(11.92) 

(11.93) 
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Here £ is a half space in the space of probability vectors, as 
in Figure 11.4. 


Theorem 11.4.1 (Sanov f s theorem) Let X\, X 2 ,..., X n 
〜 Q{x). Let E ^ V be a set of probability distributions. Then 

Q n (E) = Q n (E nV n ) <{n + l) |A ， l 2 _nD(P * lie) , 


where 


尸 * = argminD(jP||2) 

is the distribution in E that is closest to Q in relative entropy. 
If, in addition，the set E is the closure of its interior，then 

-log Q n (E) — -D(P*WQ). 
n 


Proof: We first prove the upper bound: 

Q n {E)= J2 Q n (T{P)) 
PeEHVn 

< ^2 2~ nD(Pm 

PeEHVn 


illustrated 

be i.i.d. 

(11.94) 

(11.95) 

(11.96) 

(11.97) 

(11.98) 
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< Y" max 2~ nD{pm 

_ 么 PeEnVn 

(11.99) 

— ^ 2~ nminpeE(lV n D( < P \\Q^ 

(11.100) 

PeEHVn 


2~ nmin P^E d(p\\Q) 

(11.101) 

PeEHVn 


= ^2 2~ nD(p * llQ) 

(11.102) 

PeEHVn 


<(n +1)1^2-^112), 

(11.103) 


where the last inequality follows from Theorem 11.1.1. Note that P* need 
not be a member of V n . We now come to the lower bound, for which we 
need a “nice” set E, so that for all large n, we can find a distribution in 
E r\V n that is close to P*. If we now assume that E is the closure of its 
interior (thus, the interior must be nonempty), then since U n V n is dense 
in the set of all distributions, it follows that E H V n is nonempty for all 
n > no for some no. We can then find a sequence of distributions P n such 
that P n e E r\V n and D(P n \\Q) D(P*\\Q). For each n > no, 

Q n {E)= Q' l ( T ( p )) 

PeEHVn 

> Q n {T{P n )) 

> 1 ? -nD(P n \\Q) 

- (n +1)1^1 

Consequently, 

1 M ( |A11og(n + l) 

lim inf - log Q n (E) > lim inf - ' ' 5 - - 

n \ n 

= -D{P^\\Q). (11.107) 

Combining this with the upper bound establishes the theorem. □ 

This argument can be extended to continuous distributions using quan¬ 
tization. 


(11.104) 

(11.105) 

(11.106) 


^,110 
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11.5 EXAMPLES OF SANOV，S THEOREM 

Suppose that we wish to find Pr{^ YH=\ Sj(^i) — a j » j — 1 ， 2, • • • ， ^}. 
Then the set E is defined as 

£ = Jp : J^P(a)gj(a) > aj, j = 1,2, ...,k ■ . (11.108) 

To find the closest distribution in E to Q, we minimize D(P\\Q) subject 
to the constraints in (11.108). Using Lagrange multipliers, we construct 
the functional 

p(x) 

j(p) = 尸 w lo g -q7^ + P(x)gi(x) + v^P(x). 

X / - x x 

(11.109) 

We then differentiate and calculate the closest distribution to Q to be of 
the form 


PHx)= 


Q(x) e Ei^i8iM 

Ea J 2 ⑷ 


( 11 . 110 ) 


where the constants 入 / are chosen to satisfy the constraints. Note that if 
Q is uniform, P* is the maximum entropy distribution. Verification that 
P* is indeed the minimum follows from the same kinds of arguments as 
given in Chapter 12. 

Let us consider some specific examples: 

Example 11.5.1 (Dice) Suppose that we toss a fair die n times; what 
is the probability that the average of the throws is greater than or equal 
to 4? From Sanov’s theorem, it follows that 

Q n (E) = 2~ nD(P * m , (11.111) 

where P* minimizes D(P\\Q) over all distributions P that satisfy 

6 


( 11 . 112 ) 
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From (11.110)，it follows that P* has the form 

一、 2 ^ 


尸 = 



(11.113) 


with 入 chosen so that = Solving numerically, we obtain 

入 = 0.2519, P* = (0.1031, 0.1227, 0.1461 ， 0.1740, 0.2072, 0.2468), and 
therefore D(P*\\Q) = 0.0624 bit. Thus, the probability that the average 
of 10000 throws is greater than or equal to 4 is ^ 2 -624 . 

Example 11.5.2 (Coins) Suppose that we have a fair coin and want 
to estimate the probability of observing more than 700 heads in a series 
of 1000 tosses. The problem is like Example 11.5.1. The probability is 


-nD(P*\\Q) 


P(X n >0J)=2 


(11.114) 


where P* is the (0.7, 0.3) distribution and Q is the (0.5, 0.5) distribution. 
In this case, D(P^\\Q) = 1 - H(P*) = 1 - H(0J) = 0.119. Thus, the 
probability of 700 or more heads in 1000 trials is approximately 2 -119 . 

Example 11.5.3 (Mutual dependence) Let Q(x, y) be a given joint 
distribution and let Qo(x, y) = Q(x)Q(y) be the associated product dis¬ 
tribution formed from the marginals of Q. We wish to know the likelihood 
that a sample drawn according to Qo will “appear” to be jointly dis¬ 
tributed according to Q. Accordingly, let (Xi, Yi) be i.i.d • 〜 y)= 
Q(x)Q(y). We define joint typicality as we did in Section 7.6; that is, 
(x n , y n ) is jointly typical with respect to a joint distribution Q(x, y) iff 
the sample entropies are close to their true values: 



(11.115) 



(11.116) 


and 


—log Q(x\y n )-H(X, Y) 


< €. 


(11.117) 


We wish to calculate the probability (under the product distribution) of 
seeing a pair (x n , y n ) that looks jointly typical of Q [i.e., (x n , y n ) 
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satisfies (11.115)-(11.117)]. Thus, (x n , y n ) are jointly typical with respect 
to Q(x, 3 ；) if P x ' y n e E D V n (X, Y), where 


E = {P(x, y) : 


J2 p ^y) l ogQ(x)-H(X) 


x,y 


< 6 , 


-J2 p (^^y)^Q(y)- h(Y) <e, 

x,y 


- Pi.x, y) log Q(x, y) - H(X, Y) <e}. (11.118) 


Using Sanov’s theorem, the probability is 

Q n 0 {E) = 2 - nD(P *^ Qo) , (11.119) 

where P* is the distribution satisfying the constraints that is closest to 
Qo in relative entropy. In this case, as 6 ^ 0, it can be verified (Prob¬ 
lem 11.10) that P* is the joint distribution Q, and Qo is the product 
distribution, so that the probability is 2 — ⑽⑻ 200 ) = 2 _w/(x；y) , 
which is the same as the result derived in Chapter 7 for the joint AEP. 


In the next section we consider the empirical distribution of the sequence 
of outcomes given that the type is in a particular set of distributions E. We 
will show that not only is the probability of the set E essentially determined 
by D(P*\\Q), the distance of the closest element of E to Q, but also that 
the conditional type is essentially P*, so that given that we are in set E, the 
type is very likely to be close to P*. 


11.6 CONDITIONAL LIMIT THEOREM 

It has been shown that the probability of a set of types under a distribution 
Q is determined essentially by the probability of the closest element of 
the set to Q; the probability is 2~ nD * to first order in the exponent, where 

D* = minD(P\\Q). (11.120) 

PeE 

This follows because the probability of the set of types is the sum of the 
probabilities of each type, which is bounded by the largest term times the 
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number of terms. Since the number of terms is polynomial in the length 
of the sequences, the sum is equal to the largest term to first order in the 
exponent. 

We now strengthen the argument to show that not only is the proba¬ 
bility of the set E essentially the same as the probability of the closest 
type jP* but also that the total probability of other types that are far 
away from P* is negligible. This implies that with very high probabil¬ 
ity, the type observed is close to P*. We call this a conditional limit 
theorem. 

Before we prove this result, we prove a “Pythagorean” theorem, which 
gives some insight into the geometry of D{P\\Q). Since D{P\\Q) is not 
a metric, many of the intuitive properties of distance are not valid for 
D(P\\Q). The next theorem shows a sense in which D{P\\Q) behaves 
like the square of the Euclidean metric (Figure 11.5). 

Theorem 11.6.1 For a closed convex set E C V and distribution Q ^ 
E, let P* e E be the distribution that achieves the minimum distance to 
Q; that is, 

D(P*\\Q) = min D(P\\Q). (11.121) 

PeE 

Then 

D(P\\Q) > D(P\\P^) + D(P^\\Q) (11.122) 

for all PeE. 
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Note. The main use of this theorem is as follows: Suppose that we have 
a sequence P n G E that yields D(P n \\Q) D(P*\\Q). Then from the 
Pythagorean theorem, Z) ( 尸 , 2 || 尸 *) 0 as well. 

Proof: Consider any P e E. Let 

P x = XP + (l - 入）尸 *. (11.123) 


Then / \ > 尸 * as 入 4 0. Also, since E is convex, g £ for 0 $ 入 $ 1. 
Since D(P*\\Q) is the minimum of D(Px\\Q) along the path P* —> P, 
the derivative of D{Px\\Q) as a function of 入 is nonnegative at 入 = 0. 
Now 

D x = D{P X \\Q) = YP x {x)\og^- (11.124) 

^ Q{x) 

and 

I = E (( p ( x ) — p *( x )) lo g ^^ + ( p ( x ) -jP *( x ))) - ( 1L125 ) 

Setting 入 = 0， so that Px = P* and using the fact that E P(x)=[ 尸 * 
(x) = 1, we have 


0 < 


， dD k 

、 d\ 


入 =o 




P*(x) 

Q{X) 


E 卿。 g 漂 — ^>她’ 


Q(X) 


(11.126) 

(11.127) 

(11.128) 


1>(物 譌：- [，⑴ 啤盖 ( 1L129) 

D(P\\Q)~ D(P\\P*)- D(P^\\Q), (11.130) 


which proves the theorem. □ 

Note that the relative entropy D(P\\Q) behaves like the square of the 
Euclidean distance. Suppose that we have a convex set E in TZ n . Let A 
be a point outside the set, B the point in the set closest to A, and C any 
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A 



other point in the set. Then the angle between the lines BA and BC must 
be obtuse, which implies that l\ c > l\ B + /| c , which is of the same form 
as Theorem 11.6.1. This is illustrated in Figure 11.6. 

We now prove a useful lemma which shows that convergence in relative 
entropy implies convergence in the C\ norm. 

Definition The C\ distance between any two distributions is defined as 

ii^i- PiWi^Yl I 尸 i ⑷ - 朽⑷ i. (n.131) 

ae ^： 

Let A be the set on which P\{x) > P 2 (x). Then 

ll^i - Alii = E I 巧⑴ - 朽 ⑴ I (H-132) 

= ( 巧 ⑴ - P 2 (x)) + ( p 2 M - Pdx)) (11.133) 

xeA xeA c 

=Pi(A)~ P 2 (A) + P 2 (A c ) - Pi(A c ) (11.134) 

=Pi (A) - P 2 (A) + 1 - P 2 (A) - 1 + Pi (A) (11.135) 

= 2(P l (A) - P 2 (A)). (11.136) 
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Also note that 

max P 2 (B)) = Pi(A) - P 2 (A) = ~ (11.137) 

B^X 2 

The left-hand side of (11.137) is called the variational distance between 
P\ and P 2 . 

Lemma 11.6.1 


(n.138) 


Proof: We first prove it for the binary case. Consider two binary distri¬ 
butions with parameters p and q with p > q. We will show that 


P 


P 


> 


4 


P^- + (l-p)lo gi _ q _ 2ln2 


ip - q) • 


(11.139) 


The difference g(p, q) between the two sides is 

g(P ， q) = p\og - + (l - p) log - ~ - - — ^― (p - q) 2 . (11.140) 

q 1 — ^ 2 In 2 

Then 


dg(p, q) 

dq 


P 


P 


4 


q In 2 (1—g)ln2 2 In 2 

q ~ p -A(") 


q{\ — q)\\\2 In 2 


<0 


2(q - p) (11.141) 

(11.142) 

(11.143) 


since q(l — q) < ^ and q < p. For q = p, g(p, q) = 0, and hence 
g(p, ^)^0 for q < p, which proves the lemma for the binary case. 

For the general case, for any two distributions P\ and P 2 , let 

A = {x: Pi(jc) > P 2 (x)}. (11.144) 


Define a new binary random variable Y = 0(Z), the indicator of the set A, 
and let P\ and 户 2 be the distributions of Y. Thus, P\ and P2 correspond 
to the quantized versions of P\ and P 2 . Then by the data-processing 
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inequality applied to relative entropies (which is proved in the same way 
as the data-processing inequality for mutual information), we have 


D(P l \\P 2 ) > D(P l \\P 2 ) (11.145) 

⑺⑷ 一巧⑷ ) 2 (11-146) 

2 In 2 

= ▲"巧 - 郇 ’ (1U47) 

by (11.137), and the lemma is proved. □ 


We can now begin the proof of the conditional limit theorem. We first 
outline the method used. As stated at the beginning of the chapter, the 
essential idea is that the probability of a type under Q depends exponen¬ 
tially on the distance of the type from Q, and hence types that are farther 
away are exponentially less likely to occur. We divide the set of types in 
E into two categories: those at about the same distance from Q as P* and 
those a distance 28 farther away. The second set has exponentially less 
probability than the first, and hence the first set has a conditional proba¬ 
bility tending to 1. We then use the Pythagorean theorem to establish that 
all the elements in the first set are close to P*, which will establish the 
theorem. 

The following theorem is an important strengthening of the maximum 
entropy principle. 

Theorem 11.6.2 (Conditional limit theorem) Let E be a closed con¬ 
vex subset ofV and let Q be a distribution not in E. Let X\, X 2 ,..., X n 
be discrete random variables drawn i.i.d. 〜 Q. Let P* achieve minp e E 
D(P\\Q). Then 


Pr(X { = a\P X n eE)^ P*(a) (11.148) 


in probability as n ^ oo, i.e” the conditional distribution ofX\, given that 
the type of the sequence is in E, is close to P* for large n. 


Example 11.6.1 If Xi i.i.d •〜 then 


Pr 


X\ = a 


-J2 x ^- a \ ^ p * (a) . 


(11.149) 
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where P^{a) minimizes D(P\\Q) over P satisfying P{ci)a 2 > a. This 
minimization results in 


n \a L 


尸木⑷ = 2 ⑷ 


Q( a ) e 


Xa 2 ' 


(11.150) 


where 入 is chosen to satisfy 尸 *(a)a 2 = a. Thus, the conditional dis¬ 
tribution on X\ given a constraint on the sum of the squares is a (normal¬ 
ized) product of the original probability mass function and the maximum 
entropy probability mass function (which in this case is Gaussian). 


Proof of Theorem: Define the sets 


S t = {P eV:D{P\\Q)<t). (11.151) 

The set S t is convex since D{P\\Q) is a convex function of P. Let 

D* = D(P^\\Q) = minD(P\\Q). (11.152) 

PeE 

Then P* is unique, since D(P\\Q) is strictly convex in P. Now define 
the set 


A = S d * + 28 n E (11.153) 

and 

B = E - S D * +28 nE. (11.154) 

Thus, AU B = E. These sets are illustrated in Figure 11.7. Then 


Q n {B)= J2 Q n (T(P)) (11.155) 

PeEnV n :D(P\\Q)>D*+28 

< 2- nD(Pm (11.156) 

PeEr\Vn.D(P\\Q)>D*-\-28 

< I 2 ~ n{D * +28) (11.157) 

PeEnV n :D(P\\Q)>D*+28 

<(n + i)\ x \2~ n(D * +2S '> (11.158) 
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since there are only a polynomial number of types. On the other hand, 

Q\A) > Q n (S D ^ 8 H E) (11.159) 

= Y. Q n iT{P)) (11.160) 

PeEnV n :D(P\\Q)<D*-\-8 

> 'y —— L^ 2 -nD(P||Q) (11.161) 

(n + 1)1^1 

PeEr\V n .D(P\\Q)<D*-\-S v ’ 

> - ~_L___2 -n(Z) * +5) for n sufficiently large, (11.162) 

since the sum is greater than one of the terms, and for sufficiently large n, 
there exists at least one type in So*-\-8 r\ E r\V n . Then, for n sufficiently 
large, 


Pv(Px n ^ B\Px n ^ E)= 


Q n (BHE) 

Q n (E) 


Q n (B) 

Q n (A) 


< 


(n + l)W2-«(D*+25) 
2 ~«(/)*+<5) 




=(n + l) 2m 2~ n8 , 


(11.163) 

(11.164) 

(11.165) 

(11.166) 
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which goes to 0 as n —> oc. Hence the conditional probability of B goes 
to 0 as n —> oo, which implies that the conditional probability of A goes 
to 1. 

We now show that all the members of A are close to P* in relative 
entropy. For all members of A, 

D(P\\Q) < D^ + 28. (11.167) 

Hence by the “Pythagorean” theorem (Theorem 11.6.1), 

D(P\\P*) + D(P^\\Q) < D(P\\Q) < D* + 2(5, (11.168) 

which in turn implies that 

0(P||P*) < 2(5, (11.169) 

since D(P^\\Q) = D*. Thus, P x e A implies that D(P X \\Q) < D* + 28, 
and therefore that D(P X \\P*) < 28. Consequently, since Pr{Px^ e A\Px n 
G £} —> 1, it follows that 

Pr(D(P X n\\P*) < 28\P X n eE)^\ (11.170) 

as n —> oc. By Lemma 11.6.1, the fact that the relative entropy is small 
implies that the C\ distance is small, which in turn implies that max flG ^ 
\P X n(a) - 尸 * (a) I is small. Thus, Pr(\P X n(a) - P*(a)| > e\P X n e E) ^ 
0 as n —> oo. Alternatively, this can be written as 

Pr(Zi = a\P X n e E) ^ 尸 *(a) in probability, a e X. (11.171) 


In this theorem we have only proved that the marginal distribution goes 
to P* as n oo. Using a similar argument, we can prove a stronger 
version of this theorem: 

— ^ 1 ? ^2 ~~ • • • » 

m 

=a m I Px n G £) —> ]~[ P*(ai) in probability. (11.172) 

i=l 

This holds for fixed m as n oo. The result is not true for m = n, since 
there are end effects; given that the type of the sequence is in E, the 
last elements of the sequence can be determined from the remaining ele¬ 
ments, and the elements are no longer independent. The conditional limit 
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theorem states that the first few elements are asymptotically independent 
with common distribution 尸 *. 


Example 11.6.2 As an example of the conditional limit theorem, let us 
consider the case when n fair dice are rolled. Suppose that the sum of the 
outcomes exceeds An. Then by the conditional limit theorem, the proba¬ 
bility that the first die shows a number a G {1, 2,..., 6} is approximately 
P*(a), where P*(a) is the distribution in E that is closest to the uni¬ 
form distribution, where E = {P : P(ci)a > 4}. This is the maximum 
entropy distribution given by 


尸 = 


Eli 2^ 


(11.173) 


with X chosen so that ^ iP*(i) = 4 (see Chapter 12). Here P* is the 
conditional distribution on the first (or any other) die. Apparently, the 
first few dice inspected will behave as if they are drawn independently 
according to an exponential distribution. 


11.7 HYPOTHESIS TESTING 

One of the standard problems in statistics is to decide between two alter¬ 
native explanations for the data observed. For example, in medical testing, 
one may wish to test whether or not a new drug is effective. Similarly, a 
sequence of coin tosses may reveal whether or not the coin is biased. 

These problems are examples of the general hypothesis-testing problem. 
In the simplest case, we have to decide between two i.i.d. distributions. 
The general problem can be stated as follows: 

Problem 11.7.1 Let X\, Z 2 , ..., be i.i.d •〜 Q(x). We consider two 

hypotheses: 

• Q = P { . 

^ H 2 ： Q = P 2 . 

Consider the general decision function g(x\, X 2 ,..., x n ), where g(x\, 
尤 2 ,…， x n ) = 1 means that H\ is accepted and g{x\, X 2 , ••• ， x n ) = 2 
means that H 2 is accepted. Since the function takes on only two val¬ 
ues, the test can also be specified by specifying the set A over which 
g(xi, X 2 ,..., x n ) is 1; the complement of this set is the set where 
g(xi, a ： 2 , ..., x n ) has the value 2. We define the two probabilities of error: 

a = Pr(g(X l5 X 2 , …， X n ) = 2|//i true) = Pf(A c ) (11.174) 
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and 

P= Mg(Xu X 2 , …， X n 、= \\H 2 true) = P 2 n (A). (11.175) 

In general, we wish to minimize both probabilities, but there is a trade¬ 
off. Thus, we minimize one of the probabilities of error subject to a 
constraint on the other probability of error. The best achievable error 
exponent in the probability of error for this problem is given by the 
Chernoff-Stein lemma. 

We first prove the Neyman-Pearson lemma, which derives the form of 
the optimum test between two hypotheses. We derive the result for discrete 
distributions; the same results can be derived for continuous distributions 
as well. 

Theorem 11.7.1 (Neyman - Pearson lemma) Let X\ ， X 2 , …， X n be 
drawn Li A. according to probability mass function Q. Consider the deci¬ 
sion problem corresponding to hypotheses Q = P\vs. Q = 尸 2 . For T > 0, 
define a region 

A n {T) = \x n : … -- >T\. (11.176) 

户 2( 义 1 ，义 2,… ，义 《) 

Let 

〆= P^A c n (T)), r = P2^ n (T)) (11.177) 

be the corresponding probabilities of error corresponding to decision re¬ 
gion A n . Let B n be any other decision region with associated probabilities 
of error a and If a < a*, then ^ > f*. 

Proof: Let A = A n (T) be the region defined in (11.176) and let B c X n 
be any other acceptance region. Let (pA and (j)s be the indicator func¬ 
tions of the decision regions A and B, respectively. Then for all x = 
...,x n )e X n , 

(0a(x) - 0 5 (x))(P!(x) - TP 2 (x)) > 0. (11.178) 

This can be seen by considering separately the cases x g A and x ^ A. 
Multiplying out and summing this over the entire space, we obtain 

0 < P\ — T(j)APl — P\(pB + T 


(11.179) 
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=J](Pi - TP 2 ) - - TP 2 ) 

A B 

=(1 - a*) - - (1 - a) + rp 

= T(P-n-(^-a). 

Since T > 0, we have proved the theorem. 


(11.180) 

(11.181) 

(11.182) 

□ 


The Neyman - Pearson lemma indicates that the optimum test for two 
hypotheses is of the form 


p { (x u x 2 ,...,x n ) 

P 2 (x u x 2 ,..^x n ) 


(11.183) 


This is the likelihood ratio test and the quantity 公技 ; ，突， … ，会 } is called the 
likelihood ratio. For example, in a test between two Gaussian distributions 
[i.e., between /i = cr 2 ) and /2 = A/"(—1, cr 2 )]，the likelihood ratio 
becomes 


f 1 (X 1 ,X 2 ,...,X n ) 

f 2 (X l ,X 2 ,...,X n ) 


n- = i 

n-=i 




(Xi-l) 2 

2^2 


V2^ e 


(Xj+l) 2 

la 1 


(11.184) 


2jy_, x ： 

=e + : _2 (11.185) 



(11.186) 


Hence, the likelihood ratio test consists of comparing the sample mean 
X n with a threshold. If we want the two probabilities of error to be equal, 


we should set T = 1. This is illustrated in Figure 11.8. 

In Theorem 11.7.1 we have shown that the optimum test is a likelihood 


ratio test. We can rewrite the log-likelihood ratio as 

P l (X l ,X 2 ,...,X n ) 


L(X\, ^2? • • • » ^n) ~ 


P 2 (X l ,X 2 ,...,X n ) 


(11.187) 




Pi(Xj) 

PliXi) 


(11.188) 


= ⑷ log 

aeX 


尸 1 ⑷ 

~P^a) 


=nP X n (a) log 

aeX 


尸 1 ⑷ Px n jo) 

尸 2 ⑷ Px n (a) 


(11.189) 

(11.190) 
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FIGURE 11.8. Testing between two Gaussian distributions. 


Y^ nP x n ⑷ log 

aeX 


Px n ⑷ 

尸 2 ⑷ 


- Dn/V ⑷ log 

aeX 


Px n ⑷ 
尸 1 (a) 


= nD(P X n\\P 2 )-nD(Pxn\\P l ), 


(11.191) 

(11.192) 


the difference between the relative entropy distances of the sample type 
to each of the two distributions. Hence, the likelihood ratio test 


p { (x u x 2 ,...,x n ) > T 

P 2 (X u X 2 ,...,X n ) > 

is equivalent to 

DCPHI 尸 2)-d(/Wi) > -logr. 


(11.193) 


(11.194) 


We can consider the test to be equivalent to specifying a region of the sim¬ 
plex of types that corresponds to choosing hypothesis H\. The optimum 
region is of the form (11.194), for which the boundary of the region is the 
set of types for which the difference between the distances is a constant. 
This boundary is the analog of the perpendicular bisector in Euclidean 
geometry. The test is illustrated in Figure 11.9. 

We now offer some informal arguments based on Sanov’s theorem to 
show how to choose the threshold to obtain different probabilities of error. 
Let B denote the set on which hypothesis 1 is accepted. The probability 
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FIGURE 11.9. Likelihood ratio test on the probability simplex. 


of error of the first kind is 


a n = P[\P X n G B c ). (11.195) 

Since the set B c is convex, we can use Sanov’s theorem to show that the 
probability of error is determined essentially by the relative entropy of 
the closest member of B c to P\. Therefore, 

a n = 2~ nD(P *^\ (11.196) 

where P* is the closest element of B c to distribution P\. Similarly, 

凡 = 2-” D(p 2*H p 2 )， (11.197) 

where is the closest element in B to the distribution P 2 . 

Now minimizing D(P|| 尸 2 ) subject to the constraint D( 尸 || 尸 2 ) — 
D(P\\P{) > ^ log T will yield the type in B that is closest to P 2 . Set¬ 
ting up the minimization of Z) (尸 | 丨尸 2 ) subject to D{P\\P 2 ) — 1 )( 尸 || 尸 1 ) = 
log T using Lagrange multipliers, we have 

P(x) Pi(x) 

_/(，) = E P ⑴ log g + 入 ; ^ 吵 ) log^ + vj] P{x). 

(11.198) 

Differentiating with respect to P(x) and setting to 0, we have 

log ^i +1 +Alog Si +v=0 - (1L199) 
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Solving this set of equations, we obtain the minimizing P of the form 


尸 2 * = 尸入 * = 


Pl{x)Pl~\x) 
T,aex Plia)P l 2 -\a) 


( 11 . 200 ) 


where X is chosen so that D(Px*\\P\) — D(/V|| 尸 2) = \ log T. 

From the symmetry of expression (11.200), it is clear that P* = and 
that the probabilities of error behave exponentially with exponents given 
by the relative entropies £)( 尸 *|| 尸 i) and jD ( 尸 *|| 尸 2 ). Also note from the 
equation that as 入 ^ ► 1， w Pi and as 入一 > 0, > 尸 2 . The curve 

that Px traces out as 入 varies is a geodesic in the simplex. Here Px is a 
normalized convex combination, where the combination is in the exponent 
(Figure 11.9). 

In the next section we calculate the best error exponent when one of 
the two types of error goes to zero arbitrarily slowly (the Chemoff-Stein 
lemma). We will also minimize the weighted sum of the two probabilities 
of error and obtain the Chemoff information bound. 


11.8 CHERNOFF-STEIN LEMMA 

We consider hypothesis testing in the case when one of the probabili¬ 
ties of error is held fixed and the other is made as small as possible. 
We will show that the other probability of error is exponentially small, 
with an exponential rate equal to the relative entropy between the two 
distributions. The method of proof uses a relative entropy version of the 
AEP. 

Theorem 11.8.1 (AEP for relative entropy) Let X\, X 2 , … ， X n be 
a sequence of random variables drawn i.i.d. according to P\ (x), and let 
P 2 M be any other distribution on X. Then 

— log ^ 1 ^ 1, X ，久"） £>( 尸 1 || 尸 2 ) in probability. (11.201) 

n ^ p 2 (x u x 2 ,...,x n ) F y 


Proof: This follows directly from the weak law of large numbers. 


n B P 2 (X u X 2 ,...,X n ) n B UU 


( 11 . 202 ) 






11.8 CHERNOFF-STEIN LEMMA 


381 


-J]log 


PliXi) 

P2(Xi) 


(11.203) 


Ep x log in probability (11.204) 

^(Pill^). □ (11.205) 


Just as for the regular AEP, we can define a relative entropy typical 
sequence as one for which the empirical relative entropy is close to its 
expected value. 

Definition For a fixed n and 6 > 0 , a sequence (x\, ^ 2 ,..., x n ) e X n 
is said to be relative entropy typical if and only if 

1 X 2 , , x n ) , 

D(P l \\P 2 )-€< - log ' —■< D(P { \\P 2 ) + e. ( 11 . 206 ) 

The set of relative entropy typical sequences is called the relative entropy 
typical set A^\P\\\P 2 ). 

As a consequence of the relative entropy AEP, we can show that the 
relative entropy typical set satisfies the following properties: 

Theorem 11.8.2 

1. For {xuX 2 ,...,x n ) e A[ n \P l \\P 2 ), 

Pi(xi,x 2 ,...,x n )2 - n(Z)(Pll|/，2)+e) 

< 尸2(义1,又2,…，知） 

< Pi(x u x 2 ,... , x n )2~ nmP ^ P2) ~ €) . ( 11 . 207 ) 

2. jPi(Ap) ( 尸 i|| 尸 2 )) > 1 — 6, for n sufficiently large. 

3. P 2 (A^ ) (Pi||P 2 )) < 2 - w(Z)(Pll|/，2) - €) . 

4. P 2 (A ( f\P\\\P 2 )) > (1 — ^)2~ n ^ D ^ Pl ^ P2 ^ +€ \ for n sufficiently large. 


Proof: The proof follows the same lines as the proof of Theorem 3.1.2, 
with the counting measure replaced by probability measure 尸 2 . The proof 
of property 1 follows directly from the definition of the relative entropy 
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typical set. The second property follows from the AEP for relative entropy 
(Theorem 11.8.1). To prove the third property, we write 

P 2 {A^\P i \\P 2 )) = J2 P 2 (xux 2 ,...,x n ) (11.208) 

x"e4Vill 尸 2) 

< Pdx l ,x 2 ,...,x n )2- n{DiP ^ ) - e) (11.209) 

八 4Viii 尸 2 ) 

= 2 -»(Z)(P 1 ||P 2 )- e ) p x {x x ,x 2 ,...,x n ) (11.210) 

= 2~ n{D{mPl) - €) Pi {A^ ] (Pi 11 P 2 )) (11.211) 

< 2~ nWP ^ p 2 ) ~ e \ ( 11 . 212 ) 

where the first inequality follows from property 1, and the second inequal¬ 
ity follows from the fact that the probability of any set under Pi is less 
than 1. 

To prove the lower bound on the probability of the relative entropy 
typical set, we use a parallel argument with a lower bound on the proba¬ 
bility: 


P 2 (Af ) (P 1 || J P 2 )) = p 2 (x u x 2 ,...,x n ) (11.213) 

( 尸 ill 尸 2) 

> P l (x l ,x 2 ,...,x n )2- n ^ p ^ )+e) (11.214) 

= 2 -n(D(P l \\P 2 )+e) P X {x X ,X 2 ,...,X n ) (11.215) 

x"e4 n) (AII 尸 2 ) 

= 2~ n(D(mP2)+e) Pi (Af } (Pi 11P 2 )) (11.216) 

> (1 — e)2- w(Z)(jPll|/>2)+e) , (11.217) 

where the second inequality follows from the second property of A^ n) 

(P 1 WP 2 ). □ 

With the standard AEP in Chapter 3, we also showed that any set that 
has a high probability has a high intersection with the typical set, and 
therefore has about 2 nH elements. We now prove the corresponding result 
for relative entropy. 
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Lemma 11.8.1 Let B n C X n be any set of sequences x\, X 2 ,..., x n such 
that P\(B n ) >1 — 6. Let P 2 be any other distribution such that D(P\\\P 2 ) 
< 00 . Then P 2 (B n ) > (1 - 2e)2~ n ^ p ^ )+€ \ 

Proof: For simplicity, we will denote A^ n \P\\\P 2 ) by A n . Since P\(B n ) 
> 1 — € and P(A n ) >1—6 (Theorem 11.8.2), we have, by the union of 
events bound, P\(A c n U B^) < 26, or equivalently, P\{A n D B n ) >1 — 26. 
Thus, 


Pi{B n ) > P 2 (A n n B n ) 

(11.218) 

= W) 

(11.219) 

x n eA n nB fl 


> ^ p^ x n^2-n{D(Pi\\P 2 )+€) 

(11.220) 

x n eA n r\B n 


= 2 -«(Z)(AII 尸 2 )+f) [ Pl(x n ) 

(11.221) 

x n eA n nB n 


= 2~ n(D{Pim)+e) P x {A n n B n ) 

(11.222) 

> 2 _W ( D ( 尸 ill p 2)+0(i _ 

(11.223) 


where the second inequality follows from the properties of the relative 
entropy typical sequences (Theorem 11.8.2) and the last inequality follows 
from the union bound above. □ 

We now consider the problem of testing two hypotheses, P\ vs. 尸 2 . We 
hold one of the probabilities of error fixed and attempt to minimize the 
other probability of error. We show that the relative entropy is the best 
exponent in probability of error. 

Theorem 11.8.3 (Chernoff-Stein Lemma) Let X\, X2,..., X n be 
i.i.d. 〜 Q. Consider the hypothesis test between two alternatives, Q = P\ 
and Q = P2, where D{P\\\P2) < 00 . Let A n c X n be an acceptance 
region for hypothesis H\. Let the probabilities of error be 

a„ = P^{A c n ), & = P^(A n ). (11.224) 

and for 0 < e < 毒 ， define 

^ = min p n . (11.225) 

A n C X n 
a n < e 
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Then 


lim -log^ = -D(P l \\P 2 ). (11.226) 

n—^-oo n 

Proof: We prove this theorem in two parts. In the first part we exhibit 
a sequence of sets A n for which the probability of error goes expo¬ 
nentially to zero as D(P\\\P 2 ). In the second part we show that no other 
sequence of sets can have a lower exponent in the probability of error. 

For the first part, we choose as the sets A n = Ap) (Pi 11 P 2 ). As proved in 
Theorem 11.8.2, this sequence of sets has P\{A c n ) < e for n large enough. 
Also, 

lim -\ogP 2 (A n ) < -(D(P l \\P 2 )-e) (11.227) 

n^-oo fi 

from property 3 of Theorem 11.8.2. Thus, the relative entropy typical set 
satisfies the bounds of the lemma. 

To show that no other sequence of sets can to better, consider any 
sequence of sets B n with P\{B n ) > 1 — 6. By Lemma 11.8.1, we have 
P 2 (B n ) > (1 — 2e)2 -n ( D ( Pl ll P2 )+ 6 )，and therefore 

lim -log P 2 (B n ) > -(D(P l \\P 2 ) + e) + lim -log(l -2c) 
n^-oo n n—oon 

= -(D(P l \\P 2 ) + €). (11.228) 

Thus, no other sequence of sets has a probability of error exponent better 
than D( 尸 1 丨 | 尸 2 ). Thus, the set sequence A n = A^\P\\\P 2 ) is asymptoti¬ 
cally optimal in terms of the exponent in the probability. □ 

Not that the relative entropy typical set, although asymptotically opti¬ 
mal (i.e., achieving the best asymptotic rate), is not the optimal set for 
any fixed hypothesis-testing problem. The optimal set that minimizes the 
probabilities of error is that given by the Neyman-Pearson lemma. 


11.9 CHERNOFF INFORMATION 


We have considered the problem of hypothesis testing in the classical 
setting, in which we treat the two probabilities of error separately. In the 
derivation of the Chemoff-Stein lemma, we set a n < € and achieved 
p n = 2~ nD . But this approach lacks symmetry. Instead, we can fol¬ 
low a Bayesian approach, in which we assign prior probabilities to both 
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hypotheses. In this case we wish to minimize the overall probability of 
error given by the weighted sum of the individual probabilities of error. 
The resulting error exponent is the Chernojf information. 

The setup is as follows: X\, X2,..., X n i.i.d. 〜 Q. We have two 
hypotheses: Q = P\ with prior probability 71 \ and Q = P 2 with prior 
probability 712 - The overall probability of error is 

P} n) = rna,, + TT.Jn- (11.229) 

Let 

D* = lim --log min P p {n) . (11.230) 

n^oo n A n £M n 

Theorem 11.9.1 (Chernojf) The best achievable exponent in the 
Bayesian probability of error is D*，where 


D* = D(P^\\P { ) = D(P^\\P 2 ), (11.231) 


with 

p — Pl{x)Pl~\x) 

X — Ead 片⑷⑷ 

and k* the value of "k such that 


D(P x ,\\P l ) = D(P k ,\\P 2 ). 


(11.232) 


(11.233) 


Proof: The basic details of the proof were given in Section 11.8. We 
have shown that the optimum test is a likelihood ratio test, which can be 
considered to be of the form 


D{P X n\\P 2 ) - D{P X n\\P x ) > -logr. (11.234) 

n 

The test divides the probability simplex into regions corresponding to 
hypothesis 1 and hypothesis 2, respectively. This is illustrated in Fig¬ 
ure 11.10. 

Let A be the set of types associated with hypothesis 1. From the dis¬ 
cussion preceding (11.200), it follows that the closest point in the set A c 
to Pi is on the boundary of A and is of the form given by (11.232). Then 
from the discussion in Section 11.8, it is clear that Px is the distribution 
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in A that is closest to it is also the distribution in A c that is closest 
to P\. By Sanov’s theorem, we can calculate the associated probabilities 
of error, 

a n = Pf(A c ) = 2~ nD(p ^ Pl) (11.235) 

and 

凡 =P 2 n (A) = 2~ nD ^* m) . (11.236) 

In the Bayesian case, the overall probability of error is the weighted sum 
of the two probabilities of error, 

Pe = nx 2 - nD ^ W p ^ + 7 r 2 2 ~ nD(PxllP2) = 2 - nmin { D ( p 」 l p i )， D ( p 」 l p 2)}， 

(11.237) 

since the exponential rate is determined by the worst exponent. Since 
D(P^||Pi) increases with 入 and D{Px\\Pi) decreases with 入 ， the maxi¬ 
mum value of the minimum of {D(/\||Pi) ， D(/\|| 尸 2 )} is attained when 
they are equal. This is illustrated in Figure 11.11. Hence, we choose 入 so 
that 


D{P X \\P X ) = D{P x \\P 2 ). (11.238) 

Thus, C ( 尸 1 ，尸 2 ) is the highest achievable exponent for the probability of 
error and is called the Chemoff information. □ 
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0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

入 

FIGURE 11.11. Relative entropy D(Px\\Pi) and D(Px\\P 2 ) as a function of 入 . 

The definition = D(Px*\\P\) = D(Px*\\P 2 ) is equivalent to the 
standard definition of Chernoff information ， 


C(PuP 2 ) = - o min logjj] 


(11.239) 


It is left as an exercise to the reader to show the equivalence of (11.231) 
and (11.239). 

We outline briefly the usual derivation of the Chernoff information 
bound. The maximum a posteriori probability decision rule minimizes the 
Bayesian probability of error. The decision region A for hypothesis H\ 
for the maximum a posteriori rule is 


A 


7T2 尸 2(X) 


(11.240) 


the set of outcomes where the a posteriori probability of hypothesis H\ is 
greater than the a posteriori probability of hypothesis The probability 
of error for this rule is 


Pe = 兀 

= !> iPi + E 兀 2 尸 2 


A c 


y^min{jriPi, 兀 2 尸 2[ 


(11.241) 

(11.242) 

(11.243) 



AQ.0JC9 9>lcol9y 
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Now for any two positive numbers a and b, we have 


min{a, b) < a k b l ~ k for all 0 < A < 1. 

(11.244) 

Using this to continue the chain, we have 


P e = ^2 min{^i Pi, TZ 2 P 2 } 

(11.245) 


(11.246) 

<J2 p " p 2~ x - 

(11.247) 


For a sequence of i.i.d. observations, P k (x) = YYi=i A ： ( 义 /)， an d 


i 

(11.248) 

=n e phxi)Pi-\ Xi ) 

i 

(11.249) 

<Y\T, p " p 2~" 

Xi 

(11.250) 


(11.251) 


where (11.250) follows since n\ < 1, tt 2 < 1. Hence, we have 

IlogpW <l 0g J] P^{x)Pl~ x {x). (11.252) 


Since this is true for all 入 ， we can take the minimum over 0 $ 入 $ 1 ， 
resulting in the Chernoff information bound. This proves that the exponent 
is no better than C(P\, 尸 2 ). Achievability follows from Theorem 11.9.1. 

Note that the Bayesian error exponent does not depend on the actual 
value of jt\ and 丌 2 , as long as they are nonzero. Essentially, the effect of 
the prior is washed out for large sample sizes. The optimum decision rule 
is to choose the hypothesis with the maximum a posteriori probability, 
which corresponds to the test 


Tt\P\(Xu X 2 , …, x n ) > 

丌 2 尸 2( 义 1 ，又 2, ••• ， D 


(11.253) 
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Taking the log and dividing by n, this test can be rewritten as 


I log - + I Elog ^<0, 

” 7r. n ^ P2(Xi) 


丌 2 


(11.254) 


where the second term tends to D{P\\\P 2 ) or — ( 尸 2 II 尸 l) accordingly as 
Pi or P 2 is the true distribution. The first term tends to 0, and the effect 
of the prior distribution washes out. 

Finally, to round off our discussion of large deviation theory and hypoth¬ 
esis testing, we consider an example of the conditional limit theorem. 

Example 77.9.7 Suppose that major league baseball players have a bat¬ 
ting average of 260 with a standard deviation of 15 and suppose that 
minor league ballplayers have a batting average of 240 with a standard 
deviation of 15. A group of 100 ballplayers from one of the leagues (the 
league is chosen at random) are found to have a group batting average 
greater than 250 and are therefore judged to be major leaguers. We are 
now told that we are mistaken; these players are minor leaguers. What 
can we say about the distribution of batting averages among these 100 
players? The conditional limit theorem can be used to show that the dis¬ 
tribution of batting averages among these players will have a mean of 250 
and a standard deviation of 15. To see this, we abstract the problem as 
follows. 

Let us consider an example of testing between two Gaussian distribu¬ 
tions, /1 = AT(1 ， cr 2 ) and ,2 = cr 2 ), with different means and the 

same variance. As discussed in Section 11.8, the likelihood ratio test in 
this case is equivalent to comparing the sample mean with a threshold. 
The Bayes test is “Accept the hypothesis f = if 士 Y^!i=\ > Now 

assume that we make an error of the first kind (we say that f = f\ when 
indeed / = fi) in this test. What is the conditional distribution of the 
samples given that we have made an error? 

We might guess at various possibilities: 

• The sample will look like a (U) mix of the two normal distributions. 
Plausible as this is, it is incorrect. 

• Xi ^ 0 for all i. This is quite clearly very unlikely, although it is 
conditionally likely that X n is close to 0. 

• The correct answer is given by the conditional limit theorem. If the 
true distribution is /2 and the sample type is in the set A, the condi¬ 
tional distribution is close to /*, the distribution in A that is closest to 
,2. By symmetry, this corresponds to 入 =^ in (11.232). Calculating 
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the distribution, we get 


f*(x)= 


y/ r hio^ 



e 2 (t 2 


^ , u-i ) 2 \ 士 / , q+i ) 2 \ 士 

V2^r e dx 




(11.255) 

1 e 

(x 2 + l) 

2<t^ 

(11.256) 


(x 2 +D 

2 ff 2 dx 

1 厂 

V2^cr 2 

X 2 

2^ 

(11.257) 

AA(0,ct 2 ). 


(11.258) 


It is interesting to note that the conditional distribution is normal with 
mean 0 and with the same variance as the original distributions. This 
is strange but true; if we mistake a normal population for another, the 
“shape” of this population still looks normal with the same variance 
and a different mean. Apparently, this rare event does not result from 
bizarre-looking data. 


Example 77.9.2 (Large deviation theory and football) Consider a very 
simple version of football in which the score is directly related to the 
number of yards gained. Assume that the coach has a choice between two 
strategies: running or passing. Associated with each strategy is a distri¬ 
bution on the number of yards gained. For example, in general, running 


Yards gained in pass 


Yards gained in run 



AM 



AllsuoFJAlllKJBqoJd 


FIGURE 11.12. Distribution of yards gained in a run or a pass play. 
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results in a gain of a few yards with very high probability, whereas passing 
results in huge gains with low probability. Examples of the distributions 
are illustrated in Figure 11.12. 

At the beginning of the game, the coach uses the strategy that promises 
the greatest expected gain. Now assume that we are in the closing min¬ 
utes of the game and one of the teams is leading by a large margin. 
(Let us ignore first downs and adaptable defenses.) So the trailing team 
will win only if it is very lucky. If luck is required to win, we might 
as well assume that we will be lucky and play accordingly. What is the 
appropriate strategy? 

Assume that the team has only n plays left and it must gain / yards, 
where / is much larger than n times the expected gain under each play. The 
probability that the team succeeds in achieving / yards is exponentially 
small; hence, we can use the large deviation results and Sanov’s theorem to 
calculate the probability of this event. To be precise, we wish to calculate 
the probability that — na ^ where Z/ are independent random 

variables and Z/ has a distribution corresponding to the strategy chosen. 

The situation is illustrated in Figure 11.13. Let E be the set of types 
corresponding to the constraint, 

£ = j P : E P{a)a >a\. (11.259) 

If P\ is the distribution corresponding to passing all the time, the proba¬ 
bility of winning is the probability that the sample type is in E, which by 
Sanov’s theorem is 2~ nD ^ p ^ Px \ where P* is the distribution in E that is 
closest to P\. Similarly, if the coach uses the running game all the time, 



FIGURE 11.13. Probability simplex for a football game. 
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the probability of winning is 2 ~ nD( ' P 2 ^ P2 \ What if he uses a mixture of 
strategies? Is it possible that 2~ nD ^^ Px \ the probability of winning with 
a mixed strategy, 尸入 = 入 尸 1 + (1 — 入 )P 2 , is better than the probability of 
winning with either pure passing or pure running? The somewhat surpris¬ 
ing answer is yes, as can be shown by example. This provides a reason 
to use a mixed strategy other than the fact that it confuses the defense. 


We end this section with another inequality due to Chemoff，which 
is a special version of Markov’s inequality. This inequality is called the 
Chernojf bound. 


Lemma 11.9.1 Let Y be any random variable and let ^{s) be the 
moment generating function of Y ， 

= Ee sY . (11.260) 


Then for all 5 > 0, 

Pr(F >a)< e— sa xj/(s\ (11.261) 

and thus 

Pr(F >a)< mme~ sa \l/(s). (11.262) 

5>0 


Proof: Apply Markov’s inequality to the nonnegative random variable 

e sY . □ 

11.10 FISHER INFORMATION AND THE CRAMER-RAO 
INEQUALITY 

A standard problem in statistical estimation is to determine the parameters 
of a distribution from a sample of data drawn from that distribution. 
For example, let X\, X 2 ,X n be drawn i.i.d • 〜 Af(9, 1). Suppose that 
we wish to estimate 6 from a sample of size n. There are a number of 
functions of the data that we can use to estimate 9. For example, we can 
use the first sample X\. Although the expected value of X\ is 9, it is clear 
that we can do better by using more of the data. We guess that the best 
estimate of 6 is the sample mean X n = ^^Xi. Indeed, it can be shown 
that X n is the minimum mean-squared-error unbiased estimator. 

We begin with a few definitions. Let {/(x; 6 )}, 0 g 0, denote an 
indexed family of densities, /(x; 0) > 0, / /(x; 9) dx = l for all 0 e Q. 
Here 0 is called the parameter set. 

Definition An estimator for 0 for sample size n is a function T : 
；T — ©• 
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An estimator is meant to approximate the value of the parameter. It 
is therefore desirable to have some idea of the goodness of the approxi¬ 
mation. We will call the difference T — 6 the error of the estimator. The 
error is a random variable. 

Definition The bias of an estimator T(X\, X 2 ,..., X n ) for the param¬ 
eter 6 is the expected value of the error of the estimator [i.e., the bias is 
EqT(x\, ^ 2 ,..., x n ) — 0]. The subscript 0 means that the expectation is 
with respect to the density /(•; 9). The estimator is said to be unbiased 
if the bias is zero for all 0 G © (i.e., the expected value of the estimator 
is equal to the parameter). 

Example 11.10.1 Let X\,X 2 , … ， X n drawn i.i.d •〜 f{x) = (1/A) 
e 一 x > 0 be a sequence of exponentially distributed random variables. 
Estimators of 入 include X\ and X n . Both estimators are unbiased. 

The bias is the expected value of the error, and the fact that it is 
zero does not guarantee that the error is low with high probability. We 
need to look at some loss function of the error; the most commonly 
chosen loss function is the expected square of the error. A good estima¬ 
tor should have a low expected squared error and should have an error 
that approaches 0 as the sample size goes to infinity. This motivates the 
following definition: 

Definition An estimator T{X\, X 2 ,..., X n ) for 6 is said to be consis¬ 
tent in probability if 

TX 2 , ..., X n ) — > 0 in probability as n —> oc. 

Consistency is a desirable asymptotic property, but we are interested in 
the behavior for small sample sizes as well. We can then rank estimators 
on the basis of their mean-squared error. 

Definition An estimator T\(X\, X 2 ,..., X n ) is said to dominate 
another estimator 72 (X 1 , X 2 , ..., X n ) if, for all 9, 

E X 2 , ..., X n ) -6 ) 2 <E (72(Xi, X 2 ,..., X n ) - 0) 2 . 

(11.263) 

This raises a natural question: Is there a best estimator of 9 that dom¬ 
inates every other estimator? To answer this question, we derive the 
Cramer-Rao lower bound on the mean-squared error of any estimator. 
We first define the score function of the distribution f(x; ㊀）. We then use 
the Cauchy-Schwarz inequality to prove the Cramer-Rao lower bound 
on the variance of all unbiased estimators. 
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Definition The score V is a random variable defined by 

嘉 /(■ 


v = - ln/( x ;0) 




(11.264) 


where X 〜 f{x\ 6). 

The mean value of the score is 


EV 


W 


f(x; 9) 


de 


f(x; 0) 
f(x; 6) dx 


f(x\ 9) dx 


dO 
d . 
d9 
0, 


/(x; 6 ) dx 


(11.265) 

(11.266) 

(11.267) 

(11.268) 
(11.269) 


and therefore EV 2 = var(y). The variance of the score has a special 
significance. 

Definition The Fisher information J{6) is the variance of the score: 

q -|2 


J{9) = E e 


d6 


ln/(X;0) 


(11.270) 


If we consider a sample of n random variables X\, X 2 ,..., X n drawn 
i.i.d. 〜 /(x; 0), we have 


f(xi,x 2 ,...,x n ;9) = ]~[/(x /； 9), 


(11.271) 


and the score function is the sum of the individual score functions, 


—In f(Xu x 2 ， ... ， x n .,e) ( 11 . 272 ) 

oU 


[- In / (關 


(11.273) 


J2 哪)， 


( 11 . 274 ) 
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where the V(Xi) are independent, identically distributed with zero mean. 
Hence, the sample Fisher information is 


JniO) = E e 


de 


\nf(X u X 2i ...,X n ;9) 


E 0 V 2 (X l ,X 2 ,...,X n ) 

Ee 


(11.275) 

(11.276) 

(11.277) 




nJ{6). 


(11.278) 

(11.279) 


Consequently, the Fisher information for n i.i.d. samples is n times the 
individual Fisher information. The significance of the Fisher information 
is shown in the following theorem. 

Theorem 11.10.1 {Cramer-Rao inequality) The mean-squared error 
of any unbiased estimator T (X) of the parameter 0 is lower bounded by 
the reciprocal of the Fisher information: 

var(D > (11.280) 


Proof: Let V be the score function and T be the estimator. By the 
Cauchy-Schwarz inequality, we have 

(E e [(V- E e V)(T — E e T)]) 2 < E 0 (V- E e V) 2 E e (T - E e T) 2 . 

(11.281) 

Since T is unbiased, EqT = 9 for all 0. By (11.269), EqV = 0 and hence 
E e (V - E e V)(T - E e T) = E 0 (VT). Also, by definition, var(V) = J{6). 
Substituting these conditions in (11.281), we have 

[E e (VT)f < y(0)var(r). (11.282) 

Now, 

r e ) 

EeiVT) = / dU T{x)f{x- 6 ) dx (11.283) 

J 9) 
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=J —/(^； (x) dx 

(11.284) 

= —/ dx 

(11.285) 

d 

=—EqT 
d9 

(11.286) 

d 

= — 6 
de 

(11.287) 

= 1 , 

(11.288) 

where the interchange of differentiation and integration in (11.285) can be 
justified using the bounded convergence theorem for appropriately well 
behaved f{x\ 6), and (11.287) follows from the fact that the estimator T 
is unbiased. Substituting this in (11.282), we obtain 

1 

var(r) 2 增 ’ 

(11.289) 


which is the Cramer-Rao inequality for unbiased estimators. □ 


By essentially the same arguments, we can show that for any estimator 

, {l+bUO)) 2 , 

E{T - ef > y( ^ + b 2 T {6), (11.290) 

where bj{0) = EqT — 0 and b f T {0) is the derivative of bj{0) with respect 
to 0. The proof of this is left as a problem at the end of the chapter. 

Example 11.10.2 Let X\, X 2 ,..., X n be i.i.d •〜 a 2 ), a 2 known. 
Here J(6) = n/a 1 . Let T(X U X 2 ,..., X n ) = X n = Then 

Eo(X n — 6) 2 = o 1 In = 1/7(0). Thus, X n is the minimum variance unbi¬ 
ased estimator of 0, since it achieves the Cramer-Rao lower bound. 

The Cramer-Rao inequality gives us a lower bound on the variance 
for all unbiased estimators. When this bound is achieved, we call the 
estimator efficient. 

Definition An unbiased estimator T is said to be efficient if it meets 
the Cramer-Rao bound with equality [i.e” if var(T) = 
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The Fisher information is therefore a measure of the amount of “infor¬ 
mation” about 0 that is present in the data. It gives a lower bound on the 
error in estimating 9 from the data. However, it is possible that there does 
not exist an estimator meeting this lower bound. 

We can generalize the concept of Fisher information to the multipa¬ 
rameter case, in which case we define the Fisher information matrix J{6) 
with elements 



The Cramer-Rao inequality becomes the matrix inequality 


(11.292) 




where S is the covariance matrix of a set of unbiased estimators for the 
parameters 6 and S > in the sense that the difference S — J~ l is 

a nonnegative definite matrix. We will not go into the details of the proof 
for multiple parameters; the basic ideas are similar. 

Is there a relationship between the Fisher information J(6) and quanti¬ 
ties such as entropy defined earlier? Note that Fisher information is defined 
with respect to a family of parametric distributions, unlike entropy, which 
is defined for all distributions. But we can parametrize any distribution 
f{x) by a location parameter 0 and define Fisher information with respect 
to the family of densities f(x — 6) under translation. We explore the 
relationship in greater detail in Section 17.8, where we show that while 
entropy is related to the volume of the typical set, the Fisher information 
is related to the surface area of the typical set. Further relationships of 
Fisher information to relative entropy are developed in the problems. 


SUMMARY 


Basic identities 


g n (x) = 2 ~ n(D{Pxm+H(Px)) 

Wn\ < + 

|r ( 尸 )| = i nH{p \ 


(11.293) 

(11.294) 

(11.295) 

(11.296) 


Q n (T(P)) = 2~ nD{p]]Q \ 
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Universal data compression 

< 2~ nD(P RWQ) for all Q, 

where 


^110 




Large deviations (Sanov’s theorem) 

Q'\E) = Q n (EnV„) <{n + l) |A ， l 2 _nD(p * lie) , 
D{P*\\Q) = rmn D{P\\Q). 

PeE 

If E is the closure of its interior, then 

Q n (E) = 2~ nD ^^K 

C\ bound on relative entropy 


D(P l \\P 2 )> 


21n2 


II 尸 1-P2II?. 


(11.297) 

(11.298) 

(11.299) 

(11.300) 

(11.301) 

(11.302) 


Pythagorean theorem. If £ is a convex set of types, distribution Q ^ 
E, and P* achieves D(P*\\Q) = minp G £： D(P\\Q), we have 

D(P\\Q) > D(P||P*) + D(P^\\Q) (11.303) 

for all PeE. 

Conditional limit theorem. If X\, X 2 ,..., X n i.i.d .〜 then 


Pr(Xi = ci\Px n G £) —> P*(a) in probability, 
where P* minimizes D(P\\Q) over P e E. In particular, 

Q(a)e ka 


Pr Xi 




> a 


Ex QMe 


iAj: 


(11.304) 


(11.305) 


Neyman - Pearson lemma. The optimum test between two densities 
P\ and P 2 has a decision region of the form “accept P = P\ if 

Pl(xi,X2,...,x n ) > 丁 ” 

P2(xi,X2,. ： ,X n ) • 
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Chernoff-Stein lemma. The best achievable error exponent ^ if 

0i n < 

P e n = min A, (11.306) 

A n ^X n 
a n < e 

lim -log^ = -^(^||^ 2 ). (11.307) 

n—oofi 

Chernoff information. The best achievable exponent for a Bayesian 
probability of error is 

D^ = D(P X *\\P { ) = D(P k *\\P 2 ), (11.308) 


where 

^ = Pl{x)P^~\x) 

X ~ T^~ X 

with 入 = 入 * chosen so that 


(11.309) 


D{P X \\P X ) = D{P X \\P 2 ). 


(11.310) 


Fisher information 


J (9) = E e 


89 


ln/(x; 9) 


(11.311) 


Cramer-Rao inequality. For any unbiased estimator T of 0, 

1 


E § (T(X)-ey = war(T) > 


J(e) 


(11.312) 


PROBLEMS 

11.1 Chernoff-Stein lemma. Consider the two-hypothesis test 
Hi ： f = fi vs. H 2 :f = / 2 . 


Find D(f { || fi) if 
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(b) fi(x) = kie~ XiX , x >0,i = 1, 2. 

(c) f\{x) is the uniform density over the interval [0, 1] and fiix) 
is the uniform density over [a, a + 1]. Assume that 0 < a < 1. 

(d) f\ corresponds to a fair coin and corresponds to a two- 
headed coin. 


11.2 


Relation between D(P || Q) and chi-square. 
statistic 


Show that the 




( 尸⑴ - Q{x)f 

Qix) 


is (twice) the first term in the Taylor series expansion of D{P || 
Q) about Q. Thus, D(P || Q) = ^x 2 + • • • • [Suggestion: Write 
i = 1 + and expand the log.] 

11.3 Error exponent for universal codes . A universal source code of 
rate R achieves a probability of error Pe U) = e -nD{P*\\Q)^ w here 
Q is the true distribution and P* achieves min D{P || Q) over all 
P such that H(P) > R. 

(a) Find P* in terms of Q and R. 

(b) Now let X be binary. Find the region of source probabili¬ 

ties Q(x), x e {0, 1}, for which rate R is sufficient for the 
universal source code to achieve 0. 

11.4 Sequential projection. We wish to show that projecting Q onto 
P\ and then projecting the projection Q onto Pi 门尸 2 is the same 
as projecting Q directly onto P\ 门尸 2 . Let V\ be the set of prob¬ 
ability mass functions on X satisfying 


= 1 , 

X 

p(x)hj(x) > ai, i = 1, 2, 


r. 


(11.313) 


(11.314) 


Let V 2 be the set of probability mass functions on X satisfying 
^p{x) = 1, (11.315) 

X 

^2p(x)gj(x) > Pj, 


j = 1,2,… ， s. (11.316) 
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Suppose that 2 ^ A U ^ 2 - Let P* minimize D(P || Q) over all 
P eV\. Let /?* minimize D(R || Q) over all R eV\{^\ V 2 - Argue 
that /?* minimizes D(R || P 氺 ) over all R e P 2 - 

11.5 Counting. Let = [1, 2,..., mj. Show that the number of se¬ 
quences x n G X n satisfying 士 YH=i S( x i) ^ ^ is approximately 
equal to 2 nH *, to first order in the exponent, for n sufficiently large, 
where 


//*= max H(P). (11.317) 

尸 : IT=1 

11.6 Biased estimates may be better. Consider the problem of esti¬ 
mating [i and a 1 from n samples of data drawn i.i.d. from a 
Af(/x, a 2 ) distribution. 

(a) Show that X n is an unbiased estimator of /x. 

(b) Show that the estimator 

1 n — 

s ； = - - X n ) 2 (11.318) 

U i=\ 

is a biased estimator of a 2 and the estimator 
1 n — 

s 2 n _i = - X ^ 2 ( 11 - 319 ) 

A i = \ 


is unbiased. 

(c) Show that has a lower mean-squared error than that of 
S^_ v This illustrates the idea that a biased estimator may be 
“Setter” than an unbiased estimator. 

11.7 Fisher information and relative entropy. Show for a parametric 
family {pei^)} that 

lim —— i ― ^D{p e \\pe') = (11.320) 

e，—e (6 - 6 r ) 2 ln4 w v ) 

11.8 Examples of Fisher information. The Fisher information 7(0) 
for the family fo(x), 9 e R is defined by 


J(9) = E e 


dfe(X)/dO \ 2 _ r (/；) 2 

fe(X) ) _ J 


Find the Fisher information for the following families: 
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(a) f e (x) = N(0, 0)= 金 〆 i 

(b) f e (x) = 6e- ex ,x>0 

(c) What is the Cramer-Rao lower bound on Eq(9(X) — 9) 2 , 
where 6{X) is an unbiased estimator of 6 for parts (a) and 
(b)? 

11.9 Two conditionally independent looks double the Fisher informa¬ 
tion . Let ge(x\,X 2 ) = Show that J g (6) = 2Jf(6). 

11.10 Joint distributions and product distributions. Consider a joint 
distribution Q(x, y) with marginals Q(x) and Q(y). Let E be 
the set of types that look jointly typical with respect to Q: 

E = {P{x,y)--Y J P{x,y)\og Q(x) - H(X) = 0 , 


-J]p(x ) j)log2(y)-//(^) = 0, 

-二 j ) lo § Q( x ^ y) 

-H(X, Y)=0}. (11.321) 

(a) Let Qo(x, y) be another distribution on X x y. Argue 
that the distribution P* in E that is closest to Qo is of the 
form 


— 2 0 (x, y)e Xo+Xl log ^^) +A - 2 log Q(y)+^3 log Q(x,y)^ 

(11.322) 

where 入 o ， 入 i ， 入 2 , and 入 3 are chosen to satisfy the constraints. 
Argue that this distribution is unique. 

(b) Now let Qo(x, y) = Q(x)Q(y). Verify that Q(x, y) is of the 
form (11.322) and satisfies the constraints. Thus, j)= 
Q(x, y) (i.e.，the distribution in E closest to the product dis¬ 
tribution is the joint distribution). 

11.11 Cramer-Rao inequality with a bias term. Let X 〜 /(x; 0) and 
let T(X) be an estimator for 9. Let bj{0) = EqT — 0 be the bias 
of the estimator. Show that 

7 [l + br(6)] 2 , 

E(T -Of > - - ^-— L + bl(e). (11.323) 
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11.12 Hypothesis testing. Let X\, X 2 ,..., X n be i.i.cL 〜 p{x). Con¬ 
sider the hypothesis test H\ : p = p\ vs. H 2 : p = P 2 . Let 


and 


Pi(x )= 


x = 

5, x = 0 
j, x = 1 


灼⑴ 


x =— 
x = 0 

X = l. 


Find the error exponent for PrfDecide // 2 I //1 true} in the best 
hypothesis test of H\ vs. H 2 subject to Pr{Decide H\\H 2 true} 



11.13 Sanov’s theorem. Prove a simple version of Sanov’s theorem for 
Bernoulli ⑷ random variables. 

Let the proportion of l’s in the sequence X\, X 2 ,... ,X n be 

1 n 

X n = - Y^Xi. (11.324) 

n 

i=l 


By the law of large numbers, we would expect X n to be close 
to q for large n. Sanov’s theorem deals with the probability that 
px n is far away from q. In particular, for concreteness, if we take 
p > q > Sanov’s theorem states that 

_ S log Pr {(Xi, X 2 ,..., X n ) : X n > p} 


4 plog- + (1 - p)log - ~ - 

q 1 -q 

= D{{p,\-pmqA~q)). (11-325) 

Justify the following steps: 


Pr{(X l ,X 2 ,...,X n ):X n > p} 



q'il-qT- 1 . 


(11.326) 





404 


11.14 


11.15 


11.16 


11.17 


INFORMATION THEORY AND STATISTICS 

• Argue that the term corresponding to i = Lnp」is the largest 
term in the sum on the right-hand side of the last equation. 

• Show that this term is approximately 2 - nD . 

• Prove an upper bound on the probability in Sanov’s theorem 
using the steps above. Use similar arguments to prove a lower 
bound and complete the proof of Sanov’s theorem. 

Sanov. Let Xi be i.i.d. 〜 A^(0, a 2 ). 

(a) Find the exponent in the behavior of Pr{^ Xj > a 2 }. 
This can be done from first principles (since the normal dis¬ 
tribution is nice) or by using Sanov’s theorem. 

(b) What do the data look like if 士 Y^!i=\ — That is, what 

is the P* that minimizes D(P || Q)1 

Counting states . Suppose that an atom is equally likely to be in 
each of six states, X e {s\, S 2 , 53 ,, %}• One observes n atoms 
X\, X 2 ,..., X n independently drawn according to this uniform 
distribution. It is observed that the frequency of occurrence of 
state s\ is twice the frequency of occurrence of state S 2 - 

(a) To first order in the exponent, what is the probability of 
observing this event? 

(b) Assuming n large, find the conditional distribution of the state 
of the first atom X\, given this observation. 

Hypothesis testing. Let {Xi} be i.i.d. 〜 p(x), x e {1,2, •••}• 
Consider two hypotheses, Ho : p(x) = po(x) vs. H\ : p(x)= 
P\(x), where po(x) = ( 士 ) and p\{x) = qp x ~ l , x = 1,2, 3,_ 

(a) Find D(p 0 || p\). 

(b) Let Pr{//o} = Find the minimal probability of error test for 
Ho vs. H\ given data X\, X 2 , … ， X n 〜 p(x). 

Maximum likelihood estimation. Let {fo (x)} denote a parametric 
family of densities with parameter 6 eTZ. Let X\, X 2 ,..., X n be 
i.i.d •〜 The function 
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11.15 


11.16 


11.17 



is known as the log likelihood function. Let 60 denote the true 
parameter value. 


(a) Let the expected log likelihood be 
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Ee 0 k(X n ) = I 

and show that 




7 fe Q {xi)dx n , 



E 8o Q(X n )) = (-hif eo ) - D{f do \\f e ))n. 


(b) Show that the maximum over 6 of the expected log likelihood 
is achieved by 9 = 9o, 

11.18 Large deviations• Let X\, X 2 , ... be i.i.d. random variables 
drawn according to the geometric distribution 

Pr{X = A:} = p k ~\l-p), k=\,2,.... 


Find good estimates (to first order in the exponent) of: 

⑻ >«)■ 

(b) Pr{X! =/:|I Xi>a}. 

(c) Evaluate parts (a) and (b) for p = a = 4. 

11.19 Another expression for Fisher information • Use integration by 
parts to show that 


J(0) = 


d 2 lnf e (x) 


11.20 Stirling’s approximation . Derive a weak form of Stirling’s 
approximation for factorials; that is, show that 




(11.327) 


using the approximation of integrals by sums. Justify the following 
steps: 


n—l 

ln(n!) = ln(/) + ln(n) < 

/ =2 



\nx dx + lnn = 


(11.328) 
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and 


ln(n!) = ^ln(0 


> / \nx dx 

Jo 


(11.329) 


11.21 Asymptotic value of Use the simple approximation of Prob¬ 

lem 11.20 to show that if 0 < p < 1, and k = L"p 」 (i.e., k is the 
largest integer less than or equal to np), then 


H log U 


-plogp- (1 - p) log(l - p) = H{p). 


(11.330) 

Now let /?/， / = 1，…， m be a probability distribution on m sym¬ 
bols (i.e., pi > 0 and pi = 1). What is the limiting value of 


n l0g 


Xnpi\ L«P2 」 .• • LraPm-i 」《 _ EJJo Vnpj\, 

n\ 
log 


.9 


n 一 [np x \\ [np 2 \! ... [np m -i \! (n - ㈣ 」)！ 

(11.331) 


11.22 Running difference. Let X\, X 2 ,..., X n be i.i.d •〜 Q\{x), and 

Y\,Y 2 , , Y n be i.i.d • 〜 Q 2 (y)- Let X n and Y n be independent. 

Find an expression for Pr{^^ =1 Xi — Y^=i — nt ) good to first 
order in the exponent. Again, this answer can be left in parametric 
form. 

11.23 Large likelihoods • Let X\, X 2 , … be i.i.d •〜 Q(x), x G {1, 2, 
...,m}. Let P(x) be some other probability mass function. We 
form the log likelihood ratio 


1 P n (x u x 2 ,...,x n ) _ 1 A P{Xi) 

n ° g QHX u X 2 ,...,X n ) = n^° g Q{X t ) 


of the sequence X n and ask for the probability that it exceeds a 
certain threshold. Specifically, find (to first order in the exponent) 




There may be an undetermined parameter in the answer. 
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11.24 Fisher information for mixtures . Let f\(x) and fo(x) be two 
given probability densities. Let Z be Bernoulli(0), where 6 is 
unknown. Let X 〜 f\{x) if Z = 1 and X 〜 fo(x) if Z = 0. 

(a) Find the density fe(x) of the observed X. 

(b) Find the Fisher information J{6). 

(c) What is the Cramer-Rao lower bound on the mean-squared 
error of an unbiased estimate of 01 

(d) Can you exhibit an unbiased estimator of 61 

11.25 Bent coins. Let {X/} be iid 〜 Q，where 

Q{k) = Pr(Z, = Jt) = ( \q k {\ - q) m ~ k for 々 = 0, 1 ， 2, ... ， m. 


Thus, the Xi’s are iid 〜 Binomial(m, q). Show that as n ^ oc, 


Pr 



=k 




11.26 


where P* is Binomial(m ， 入） (i.e., P^(k)= 
some 入 e [0, 1]). Find 入 . 

Conditional limiting distribution 



(a) Find the exact value of 


Pr 




(11.332) 


if X\, X 2 ,..., are Bernoulli(|) and n is a multiple of 4. 

(b) Now let Xie{—1, 0, 1} and let X\, X 2 ... be i.i.d. uniform 
over {—1, 0, +1}. Find the limit of 


Pr 




(11.333) 


for n = 2k, k —> oc. 
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11.27 Variational inequality. Verify for positive random variables X 
that 

log£ P (X) = sup[£ 0 (logZ) -D(Q\\P)], (11.334) 

Q 

where 五尸 (X) = ； ^xP(x) and D(Q\\P) = E x 

and the supremum is over all Q(x) > 0， E QM = i- 

It is enough to extremize J(Q) = Eq In X — D(Q\\P) + 

ME 2W - 1). 

11.28 Type constraints 

(a) Find constraints on the type Px n such that the sample variance 
^ - (X n ) 2 < a, where x l = \ Y!i=\ x f and 

(b) Find the exponent in the probability Q n (X^ — (X n ) 2 < a). 
You can leave the answer in parametric form. 

11.29 Uniform distribution on the simplex. Which of these methods 
will generate a sample from the uniform distribution on the sim¬ 
plex {x e R n : Xi > 0, YH=\ x i = !} ? 

(a) Let Y[ be i.i.d. uniform [0, 1] with X[ = ^'/ ^j- 

(b) Let F/ be i.i.d. exponentially distributed 〜入 厂入 ' y > 0, with 

(c) (Break stick into n parts) Let Y\,Y 2 ,, Y n -\ be i.i.d. uni¬ 
form [0, 1], and let Xi be the length of the /th interval. 

HISTORICAL NOTES 

The method of types evolved from notions of strong typicality; some 
of the ideas were used by Wolfowitz [566] to prove channel capacity 
theorems. The method was fully developed by Csiszar and Korner [149], 
who derived the main theorems of information theory from this viewpoint. 
The method of types described in Section 11.1 follows the development 
in Csiszar and Komer. The C\ lower bound on relative entropy is due to 
Csiszar [138], Kullback [336], and Kemperman [309]. Sanov’s theorem 
[455] was generalized by Csiszar [141] using the method of types. 


CHAPTER 12 

MAXIMUM ENTROPY 


The temperature of a gas corresponds to the average kinetic energy of the 
molecules in the gas. What can we say about the distribution of veloci¬ 
ties in the gas at a given temperature? We know from physics that this 
distribution is the maximum entropy distribution under the temperature 
constraint, otherwise known as the Maxwell—Boltzmann distribution. The 
maximum entropy distribution corresponds to the macrostate (as indexed 
by the empirical distribution) that has the most microstates (the individual 
gas velocities). Implicit in the use of maximum entropy methods in physics 
is a sort of AEP which says that all microstates are equally probable. 


12.1 MAXIMUM ENTROPY DISTRIBUTIONS 

Consider the following problem: Maximize the entropy h(f) over all 
probability densities / satisfying 


1. f(x) > 0, with equality outside the support set S 

2. f,f(x)dx = 1 

3. J s f(x)ri(x) dx = a[ for l < i < m. 


( 12 . 1 ) 


Thus, / is a density on support set S meeting certain moment con¬ 
straints ai, o? 2 ,..., oi m . 

Approach 1 ( Calculus) The differential entropy h(f) is a concave 

function over a convex set. We form the functional 


J(f) 


/ In / + 入 o / / + 二入 / 


( 12 . 2 ) 


and “differentiate” with respect to /(x), the xth component of /, to obtain 


dJ 

dfM 


— In / (x) — 1 + 入 o + 〉 : 入 / 。 ( 又 ). 


(12.3) 
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Setting this equal to zero, we obtain the form of the maximizing density 

f(x) = e^o-1+EfLi x G (12.4) 

where 入 0 , 入 1 ， •. • ，入 m are chosen so that / satisfies the constraints. 

The approach using calculus only suggests the form of the density that 
maximizes the entropy. To prove that this is indeed the maximum, we can 
take the second variation. It is simpler to use the information inequality 

^un/) >0. 

Approach 2 (Information inequality) If g satisfies (12.1) and if /* is 
of the form (12.4), then 0 < D(g\\f*) = —h(g) + "(/*)• Thus h(g) < 
h(f*) for all g satisfying the constraints. We prove this in the following 
theorem. 


Theorem 12.1.1 {Maximum entropy distribution) Let f*(x) = fx{x) 
= e 入 o+E, = i Wa ：), x ^ 5 where 入 0 , …， are chosen so that /* satisfies 
(12.1). Then /* uniquely maximizes h(f) over all probability densities f 
satisfying constraints (12.1). 


Proof: Let g satisfy the constraints (12.1). Then 

Kg) = _ f s glng 

= -fs gln f ， 

= -D(g\\n- ! gin/* 


⑻ 

< 


g^f* 


(b) 


g ( 入 0 + 


(C) 


-/ 广 ( 入 0 + 人 i r i 


rin 广 


nn, 


(12.5) 

( 12 . 6 ) 

(12.7) 

( 12 . 8 ) 
(12.9) 

( 12 . 10 ) 

( 12 . 11 ) 

( 12 . 12 ) 


where (a) follows from the nonnegativity of relative entropy, (b) follows 
from the definition of /*, and (c) follows from the fact that both /* 
and g satisfy the constraints. Note that equality holds in (a) if and only 
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if g(x) = /*(x) for all x, except for a set of measure 0， thus proving 
uniqueness. □ 

The same approach holds for discrete entropies and for multivariate 
distributions. 


12.2 EXAMPLES 


Example 12.2.1 (One-dimensional gas with a temperature constraint) 
Let the constraints be EX = 0 and EX 2 = a 2 . Then the form of the 
maximizing distribution is 


f(x) 


e 


入 0+ 入 1又+ 入 2 又 


(12.13) 


To find the appropriate constants, we first recognize that this distribution 
has the same form as a normal distribution. Hence, the density that satisfies 
the constraints and also maximizes the entropy is the A/*(0, a 2 ) distribution: 


= 




(12.14) 


\f2na 2 

Example 12.2.2 (Dice，no constraints) Let S = {1, 2, 3, 4, 5, 6}. The 
distribution that maximizes the entropy is the uniform distribution, p(x)= 
^ for x e S. 

Example 12.2.3 (Dice, with EX = ipi = a) This important exam¬ 
ple was used by Boltzmann. Suppose that n dice are thrown on the table 
and we are told that the total number of spots showing is not. What 
proportion of the dice are showing face /, i = 1, 2, … ， 6? 

One way of going about this is to count the number of ways that 

( n \ 

n dice can fall so that Yi[ dice show face i. There are I I 

\^ 1 , " 2, …， ^6 / 

such ways. This is a macrostate indexed by (^i, ri 2 , ..., n^) corresponding 

( n \ i 

to I I microstates, each having probability To find the 

\nuri2, …， n 6 J 0 

f n 、 

most probable macrostate, we wish to maximize I 1 under 

\ni,ri2, …， ^6/ 

the constraint observed on the total number of spots, 

6 




mi 


not. 


(12.15) 


Using a crude Stirling’s approximation, n\^ (f) n , we find that 

( n 

\ni,n 2 , ...,n 6 J nLi (f)" ! 


(12.16) 
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nH 

=e 


(H ^2 ^6\ 

\ n , n ，…， n ) 


(12.17) 

(12.18) 


Thus, maximizing ( I under the constraint (12.15) is almost 

\^l, "2, … ， ^6/ 

equivalent to maximizing H(p\, p 2 ,..., pe) under the constraint ^ ipi = 
a. Using Theorem 12.1.1 under this constraint, we find the maximum 
entropy probability mass function to be 




EW ’ 


(12.19) 


where 入 is chosen so that ^ ip* = a. Thus, the most probable macrostate 
is {np\, np \...., np^), and we expect to find n* = np* dice showing 
face i. 


In Chapter 11 we show that the reasoning and the approximations are 
essentially correct. In fact, we show that not only is the maximum entropy 
macrostate the most likely, but it also contains almost all of the probability. 
Specifically, for rational a, 

n 

< e,/ = 1,2, ...,6 J2 x ' = na — 1， (12.20) 

i = \ 

as ^ ^ oc along the subsequence such that not is an integer. 



Example 12.2.4 Let S = [a, b], with no other constraints. Then the 
maximum entropy distribution is the uniform distribution over this range. 

Example 12.2.5 S =[0, oc) and EX = [i. Then the entropy-maxi¬ 
mizing distribution is 

1 

f{x) = —e ^, x >0. (12.21) 

/x 

This problem has a physical interpretation. Consider the distribution of the 
height X of molecules in the atmosphere. The average potential energy of 
the molecules is fixed, and the gas tends to the distribution that has the 
maximum entropy subject to the constraint that E{mgX) is fixed. This 
is the exponential distribution with density f{x)= 入入 ' x > 0. The 
density of the atmosphere does indeed have this distribution. 
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Example 12.2.6 S = (—oc, oc), and EX = /x. Here the maximum en¬ 
tropy is infinite, and there is no maximum entropy distribution. (Consider 
normal distributions with larger and larger variances.) 

Example 12.2.7 S = (—oo, oo), EX = a\, and EX 2 = The maxi¬ 
mum entropy distribution is M{oc\, ot 2 — a^). 

Example 12.2.8 S = TZ n , EXiXj = Kij, 1 < /, j < n. This is a mul¬ 
tivariate example, but the same analysis holds and the maximum entropy 
density is of the form 

fix) = e x o+T,ij^jXix j ^ ( 12 . 22 ) 

Since the exponent is a quadratic form, it is clear by inspection that the 
density is a multivariate normal with zero mean. Since we have to satisfy 
the second moment constraints, we must have a multivariate normal with 
covariance Kij , and hence the density is 

/(X) = 77=^ - e-^ TK ~ l \ 

(x/2^) m 1 ， 2 

which has an entropy 

K)) = Uog(2jre) n \Kl 

as derived in Chapter 8. 

Example 12.2.9 Suppose that we have the same constraints as in Ex¬ 
ample 12.2.8, but EXiXj = Kij only for some restricted set of (/, j) e A. 
For example, we might know only K" for i = ^ /士 2. Then by comparing 
(12.22) and (12.23), we can conclude that = 0 for (/, j) g A c 

(i.e., the entries in the inverse of the covariance matrix are 0 when (/, j) 
is outside the constraint set). 

12.3 ANOMALOUS MAXIMUM ENTROPY PROBLEM 


(12.23) 


(12.24) 


We have proved that the maximum entropy distribution subject to the 


constraints 


J hi(x)f(x) dx = a，i 


(12.25) 


is of the form 

f(x) = e Xo+ ^ Xihi(x) (12.26) 

if 入 o, 入 1 ， … ，入 p satisfying the constraints (12.25) exist. 



414 


MAXIMUM ENTROPY 


We now consider a tricky problem in which the 入 / cannot be chosen 
to satisfy the constraints. Nonetheless, the “maximum” entropy can be 
found. We consider the following problem: Maximize the entropy subject 


to the constraints 

/»oo 

/ f{x) dx 

= 1 ， 

(12.27) 


J —oo 
poo 

I xf{x) dx 

= Oil, 

(12.28) 


J — oo 

/»oo 

/ X 1 f{x) dx 

= 汉 2 ， 

(12.29) 


J —oo 

/ »oo 

/ x 3 f(x) dx 
J —oo 

= 以 3. 

(12.30) 


Here, the maximum entropy distribution, if it exists, must be of the form 

f(x) = e ^ix+^x 2 +^\ (12.31) 


But if 入 3 is nonzero, f 二 f = oc and the density cannot be normalized. 
So 入 3 must be 0. But then we have four equations and only three variables, 
so that in general it is not possible to choose the appropriate constants. 
The method seems to have failed in this case. 

The reason for the apparent failure is simple: The entropy has a least 
upper bound under these constraints, but it is not possible to attain it. Con¬ 
sider the corresponding problem with only first and second moment con¬ 
straints. In this case, the results of Example 12.2.1 show that the entropy- 
maximizing distribution is the normal with the appropriate moments. With 
the additional third moment constraint, the maximum entropy cannot be 
higher. Is it possible to achieve this value? 

We cannot achieve it, but we can come arbitrarily close. Consider a 
normal distribution with a small “wiggle” at a very high value of x. The 
moments of the new distribution are almost the same as those of the old 
one, the biggest change being in the third moment. We can bring the 
first and second moments back to their original values by adding new 
wiggles to balance out the changes caused by the first. By choosing the 
position of the wiggles, we can get any value of the third moment without 
reducing the entropy significantly below that of the associated normal. 
Using this method, we can come arbitrarily close to the upper bound for 
the maximum entropy distribution. We conclude that 

suph(f) = /z(A/*(0, 的一 oi\)) = ^ In2ne(a2 — c^). (12.32) 
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This example shows that the maximum entropy may only be 6-achiev¬ 
able. 

12.4 SPECTRUM ESTIMATION 

Given a stationary zero-mean stochastic process {X/}, we define the auto¬ 
correlation function as 


R(k) = EXiX i+k . (12.33) 

The Fourier transform of the autocorrelation function for a zero-mean 
process is the power spectral density 5*( 入)： 

oo 

5(A) = R(m)e~ imX , -7T <l<n, (12.34) 

m=—oo 

where i = Since the power spectral density is indicative of the 

structure of the process, it is useful to form an estimate from a sample of 
the process. 

There are many methods to estimate the power spectrum. The simplest 
way is to estimate the autocorrelation function by taking sample averages 
for a sample of length n, 

1 n—k 

R(k) = ^ r V XtX i+k . (12.35) 

n-k ^ 

i=\ 

If we use all the values of the sample correlation function R(-) to cal¬ 
culate the spectrum, the estimate that we obtain from (12.34) does not 
converge to the true power spectrum for large n. Hence, this method, the 
periodogram method, is rarely used. One of the reasons for the problem 
with the periodogram method is that the estimates of the autocorrelation 
function from the data have different accuracies. The estimates for low 
values of k (called the lags) are based on a large number of samples and 
those for high k on very few samples. So the estimates are more accurate 
at low k. The method can be modified so that it depends only on the 
autocorrelations at low k by setting the higher lag autocorrelations to 0. 
However, this introduces some artifacts because of the sudden transition to 
zero autocorrelation. Various windowing schemes have been suggested to 
smooth out the transition. However, windowing reduces spectral resolution 
and can give rise to negative power spectral estimates. 

In the late 1960s, while working on the problem of spectral estimation for 
geophysical applications, Burg suggested an alternative method. Instead of 
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setting the autocorrelations at high lags to zero, he set them to values that 
make the fewest assumptions about the data (i.e., values that maximize the 
entropy rate of the process). This is consistent with the maximum entropy 
principle as articulated by Jaynes [294]. Burg assumed the process to be 
stationary and Gaussian and found that the process which maximizes the 
entropy subject to the correlation constraints is an autoregressive Gaussian 
process of the appropriate order. In some applications where we can assume 
an underlying autoregressive model for the data, this method has proved 
useful in determining the parameters of the model (e.g., linear predictive 
coding for speech). This method (known as the maximum entropy method 
or Burg’s method) is a popular method for estimation of spectral densities. 
We prove Burg’s theorem in Section 12.6. 

12.5 ENTROPY RATES OF A GAUSSIAN PROCESS 

In Chapter 8 we defined the differential entropy of a continuous random 
variable. We can now extend the definition of entropy rates to real-valued 
stochastic processes. 

Definition The differential entropy rate of a stochastic process {X/}, Xi g 
71, is defined to be 


h(X) 


lim h(X ^-...d 

^oo n 


(12.36) 


if the limit exists. 

Just as in the discrete case, we can show that the limit exists for sta¬ 
tionary processes and that the limit is given by the two expressions 


h{X) = lim 

n^-oo 


h(X u X 2 ,...,X n ) 


=lim h(X n \X n -\, ..., X\). 

oo 

For a stationary Gaussian stochastic process, we have 


(12.37) 

(12.38) 


h(XuX 2 , ...,X n ) = - \og{2ne) n \K (n) \, (12.39) 


where the covariance matrix is Toeplitz with entries /?(0), /?(1),..., 
R(n - 1) along the top row. Thus, = R{i - j) = E(Xi - EXi)(Xj 
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— EXj). As n ^ oo, the density of the eigenvalues of the covariance 
matrix tends to a limit, which is the spectrum of the stochastic process. 
Indeed, Kolmogorov showed that the entropy rate of a stationary Gaussian 
stochastic process can be expressed as 


h{X) = - \og2ne + — f log 5(A) dX. (12.40) 

2 4tt J_ n 

The entropy rate is also lim^^oo h(X n \X n ~ l ). Since the stochastic pro¬ 
cess is Gaussian, the conditional distribution is also Gaussian, and hence 
the conditional entropy is ^ log 2nea^ where cr^ is the variance of the 
error in the best estimate of X n given the infinite past. Thus, 

4 = ‘2 ， (12.41) 


where h(X) is given by (12.40). Hence, the entropy rate corresponds to 
the minimum mean-squared error of the best estimator of a sample of the 
process given the infinite past. 

12.6 BURGAS MAXIMUM ENTROPY THEOREM 

Theorem 12.6.1 The maximum entropy rate stochastic process {X/} sat¬ 
isfying the constraints 

EXiXi^k = oi/c, k = 0, l,. .., p for all i, (12.42) 

is the pth-order Gauss-Markov process of the form 

p 

X i = -Y J a k^i-k + Z i , (12.43) 

k=\ 

where the Zi are i.i.d. 〜 A/*(0, a 2 ) and ai, ^ 2 ,..., a p , a 2 are chosen to 
satisfy (12.42). 

Remark We do not assume that {X/} is (a) zero mean, (b) Gaussian, or 
(c) wide-sense stationary. 

Proof: Let X\, X 2 ,..., X n be any stochastic process that satisfies the 
constraints (12.42). Let Zi, Z 2 , …， Z„ be a Gaussian process with the 
same covariance matrix as X\, X 2 ,..., X n . Then since the multivariate 
normal distribution maximizes the entropy over all vector-valued random 
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variables under a covariance constraint, we have 

h(X u X 2 ,..., X n ) < h{Z u Z 2 , ..., Z n ) (12.44) 

n 

=h(Z \,..., Z p ) + ^ h(Zi\Zi-\, Z/_2, ..., Zi) 

i=p-\-\ 

(12.45) 

n 

< h(Z\, ..., Z p ) + ^ h(Zi\Zi-\, Z/— 2,… ， Zi^p) 

i=p+l 

(12.46) 

by the chain rule and the fact that conditioning reduces entropy. Now 
define Z;, Z; ， … ， Z 二 as a pth-order Gauss-Markov process with the 
same distribution as Z\, Z 2 ,..., Z n for all orders up to p. (Existence of 
such a process will be verified using the Yule-Walker equations immedi¬ 
ately after the proof.) Then since /z(Z z |Z/_i, … ， Z(- p ) depends only on 
the pth-order distribution, Zz(Z/|Z/_ 1 ， •.. ， Z/_ p )= /z(Z -|Z-_ 1? • • • ， Z^_ p ), 
and continuing the chain of inequalities, we obtain 

n 

h(X\, X 2 , ..., X n ) < h(Z\, ..., Zp) + ^ h(Zi\Zi-\, Z/_2, ••• ， Z/_ p ) 

i=p+l 

(12.47) 

n 

= h(z[,...,z r p ) + Y] /kzu _ 2 ，…， 

i=p+\ 

(12.48) 

= (12.49) 

where the last equality follows from the pth-order Markovity of the {Z-}. 

Dividing by n and taking the limit, we obtain 

Hm-/z(Xi ,X 2 ,...,X n )< lim -h(Z [, Z；,..., Z') =/i*, (12.50) 

n n 

where 

"* = ^ log 2nea 2 , (12.51) 

which is the entropy rate of the Gauss-Markov process. Hence, the max¬ 
imum entropy rate stochastic process satisfying the constraints is the 
pth-order Gauss-Markov process satisfying the constraints. □ 

A bare-bones summary of the proof is that the entropy of a finite 
segment of a stochastic process is bounded above by the entropy of a 
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segment of a Gaussian random process with the same covariance structure. 
This entropy is in turn bounded above by the entropy of the minimal order 
Gauss-Markov process satisfying the given covariance constraints. Such 
a process exists and has a convenient characterization by means of the 
Yule-Walker equations given below. 

Note on the choice of a\，, a p and a 2 : Given a sequence of covariances 
/?(0), /?(1),..., R(p )，does there exist a pth-order Gauss-Markov pro¬ 
cess with these covariances? Given a process of the form (12.43), can we 
choose the a^s to satisfy the constraints? Multiplying (12.43) by 
and taking expectations, noting that R(k) = R(—k), we get 

p 

R(0) = -J2a k R(-k) + a 2 (12.52) 

k=l 

and 

p 

R(l) = - ^a k R{l -k), 1 = 1,2, .... (12.53) 

k=\ 

These equations are called the Yule-Walker equations. There are p + I 
equations in the p + l unknowns a], …， a p , a 2 . Therefore, we can 
solve for the parameters of the process from the covariances. 

Fast algorithms such as the Levinson algorithm and the Durbin algo¬ 
rithm [433] have been devised to use the special structure of these 
equations to calculate the coefficients a\, a 2 ,..., a p efficiently from the 
covariances. (We set ao = l for a consistent notation.) Not only do the 
Yule-Walker equations provide a convenient set of linear equations for 
calculating the a^s and a 2 from the R(k)\ they also indicate how the 
autocorrelations behave for lags greater than p. The autocorrelations for 
high lags are an extension of the values for lags less than p. These val¬ 
ues are called the Yule-Walker extension of the autocorrelations. The 
spectrum of the maximum entropy process is seen to be 

oo 

5(A) = (12.54) 

m=—oo 

cr 2 

= - r, —7T S 乂 S I (12.55) 

|i + ELi«^l 

This is the maximum entropy spectral density subject to the constraints 

_，mi …， r(p). 

However, for the pth-order Gauss-Markov process, it is possible to 
calculate the entropy rate directly without calculating the a^s. Let K p be 
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the autocorrelation matrix corresponding to this process — the matrix with 
Rq ， Ri, ..., R p along the top row. For this process, the entropy rate is 
equal to 

h* = h{X p \X p . u • • • ， X 0 ) = h(X 0 , h(X 0 ,..., X p _i) 

(12.56) 

= Uog(27Te) p+l \K p \- UogiljreyiKp^] 

(12.57) 

= -log(27Te)-^-. (12.58) 

In a practical problem, we are generally given a sample sequence 
X\, X 2 ,..., X n , from which we calculate the autocorrelations. An impor¬ 
tant question is: How many autocorrelation lags should we consider (i.e., 
what is the optimum value of p)l A logically sound method is to choose 
the value of p that minimizes the total description length in a two-stage 
description of the data. This method has been proposed by Rissanen 
[442, 447] and Barron [33] and is closely related to the idea of Kol¬ 
mogorov complexity. 


SUMMARY 


Maximum entropy distribution. Let / be a probability density satis¬ 
fying the constraints 


J f(x)ri(x) = at for l < i < m. 


(12.59) 


Let /*(x) = fx{x) = e 入 0+ E^=i 入” ⑴, x e S, and let 入 0, .. •，入 m be cho¬ 
sen so that /* satisfies (12.59). Then /* uniquely maximizes h(f) over 
all / satisfying these constraints. 


Maximum entropy spectral density estimation. The entropy rate of a 
stochastic process subject to autocorrelation constraints Ro, R\,..., R p 
is maximized by the pth order zero-mean Gauss-Markov process satis¬ 
fying these constraints. The maximum entropy rate is 

,1 \K P \ 

h* = (12.60) 
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and the maximum entropy spectral density is 


^入） = 


i + ELi ^- /u | 2 


( 12 . 61 ) 


PROBLEMS 

12.1 Maximum entropy • Find the maximum entropy density /, 

defined for x > 0, satisfying EX = a\, E\nX = ol^ - That is, max¬ 
imize —f /In / subject to f xf(x)dx = f (Inx)f(x) dx = 

of 2 , where the integral is over 0 < x < oo. What family of densi¬ 
ties is this? 

12.2 Min D(P || Q) under constraints on P. We wish to find the 
(parametric form) of the probability mass function P(x), x G {1, 2, 
...} that minimizes the relative entropy D(P || Q) over all P such 
that E P(x)gi(x) = a/, / = 1, 2, .... 

(a) Use Lagrange multipliers to guess that 

P ⑴ = 入 0 (12.62) 

achieves this minimum if there exist 入 /’s satisfying the a/ 
constraints. This generalizes the theorem on maximum en¬ 
tropy distributions subject to constraints. 

(b) Verify that P* minimizes D(P || Q). 

12.3 Maximum entropy processes • Find the maximum entropy rate 
stochastic process {Z/}^ subject to the constraints: 

(a) EX} = 1, / = 1,2,.... 

(b) EXj = 1, EXiX i+l = / = 1,2,.... 

(c) Find the maximum entropy spectrum for the processes in parts 
(a) and (b). 

12.4 Maximum entropy with marginals. What is the maximum en¬ 
tropy distribution p(x, y) that has the following marginals? 


X 

1 

2 

3 

1 

P\\ 

P\2 

P\3 

2 

P2\ 

P22 

P23 

3 

P31 

P32 

^33 
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{Hint: You may wish to guess and verify a more general result.) 

12.5 Processes with fixed marginals. Consider the set of all densities 
with fixed pairwise marginals fx u x 2 ( x u fx 2 ,x 3 (x 2 , x 3 ), • • • ， 
fx n _ x ,x n ( x n-\^ x n ). Show that the maximum entropy process with 
these marginals is the first-order (possibly time-varying) Markov 
process with these marginals. Identify the maximizing /*(xi, X 2 , 
• • • » 

12.6 Every density is a maximum entropy density . Let fo(x) be a 
given density. Given r{x), let g^ix) be the density maximizing 
h(X) over all / satisfying f f{x)r{x) dx = a. Now let r(x)= 
In fo(x). Show that g a {x) = fo{x) for an appropriate choice a = 
ao. Thus, fo(x) is a maximum entropy density under the constraint 
/ / In /o = a 0 . 

12.7 Mean-squared error. Let {Z/}^ =1 satisfy EXiXi + k = Rk，k = 
0, 1,..., Consider linear predictors for X n ; that is, 

n— 1 

Xn = 〉： 办 / ^n—i • 

i=l 

Assume that n > p. Find 

max min E(X n — X n ) 2 , 
f(x n ) b 

where the minimum is over all linear predictors b and the maxi¬ 
mum is over all densities / satisfying Rq, … ， R p . 

12.8 Maximum entropy characteristic functions . We ask for the max¬ 
imum entropy density /(x), 0 < x < a, satisfying a constraint on 
the characteristic function 少 （ w) = e lux f(x) dx. The answers 
need be given only in parametric form. 

(a) Find the maximum entropy / satisfying f{x) cos(uox) dx 
=a, at a specified point uo, 

(b) Find the maximum entropy / satisfying f{x) sin(wo^) dx 
=^ 

(c) Find the maximum entropy density f(x), 0 < x < a, having a 
given value of the characteristic function 屮 （ wo) at a specified 
point wo- 

(d) What problem is encountered if a = oo? 
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12.9 Maximum entropy processes 

(a) Find the maximum entropy rate binary stochastic process 
{ 足 .} 二 - 00 , 足 .e {0, 1}, satisfying Pr{Z ( = X i+ i} = ^ for 
all i. 

(b) What is the resulting entropy rate? 

12.10 Maximum entropy of sums• Let Y = X\ + X 2 . Find the maxi¬ 
mum entropy density for Y under the constraint EX' = P\, EX\ 
=P2: 

(a) If X\ and X 2 are independent. 

(b) If X\ and X 2 are allowed to be dependent. 

(c) Prove part (a). 

12.11 Maximum entropy Markov chain. Let {Xi} be a stationary 
Markov chain with Xi e {1,2, 3}. Let I(X n ; X n+ 2 ) = 0 for all n. 

(a) What is the maximum entropy rate process satisfying this 
constraint? 

(b) What if I (X n ; X n ^ 2 ) = ot for all n for some given value of 
a, 0 < a < log 3? 

12.12 Entropy bound on prediction error. Let {X n } be an arbitrary real 
valued stochastic process. Let X n+ \ = E{X n+ \\X n }. Thus the con¬ 
ditional mean X n+ \ is a random variable depending on the n-past 
X n . Here X n+ \ is the minimum mean squared error prediction of 
X n+ \ given the past. 

(a) Find a lower bound on the conditional variance E{E{{X n+ \ 
— X n+ \) 2 \X n }} in terms of the conditional differential entropy 
h{X n+l \X n ). 

(b) Is equality achieved when {X n } is a Gaussian stochastic pro¬ 
cess? 

12.13 Maximum entropy rate. What is the maximum entropy rate sto¬ 
chastic process {Xi} over the symbol set {0, 1} for which the 
probability that 00 occurs in a sequence is zero? 

12.14 Maximum entropy 

(a) What is the parametric-form maximum entropy density f{x) 
satisfying the two conditions 

EX & =a, EX 16 = bl 

(b) What is the maximum entropy density satisfying the condition 

£(X 8 + Z 16 ) =a + bl 


(c) Which entropy is higher? 
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12.15 Maximum entropy • Find the parametric form of the maximum 
entropy density / satisfying the Laplace transform condition 

J f(x)t~ x dx = a, 

and give the constraints on the parameter. 

12.16 Maximum entropy processes • Consider the set of all stochastic 
processes with {X/}, Xi e TZ, with 

^o = EX] = 1, R x = EXiX i+l = i 
Find the maximum entropy rate. 

12.17 Binary maximum entropy • Consider a binary process {X/}, Xi G 
{-1, +1}, with Rq = EXj = 1 and Ri = EXiX i+x - 

(a) Find the maximum entropy process with these constraints. 

(b) What is the entropy rate? 

(c) Is there a Bernoulli process satisfying these constraints? 

12.18 Maximum entropy. Maximize /z(Z, V x , V y , V z ) subject to the en¬ 
ergy constraint E(^m\\V\\ 2 + mgZ) = Eq. Show that the resulting 
distribution yields 

E^m\\V\\ 2 = 备 £o, EmgZ = ^E 0 . 

Thus, I of the energy is stored in the potential field, regardless of 
its strength g. 

12.19 Maximum entropy discrete processes 

(a) Find the maximum entropy rate binary stochastic process 
{^} 二 _oo，A e {0, 1}, satisfying Pr{X, = X /+ i} = ^ for all 
i. 

(b) What is the resulting entropy rate? 

12.20 Maximum entropy of sums. Let Y = X\ + Find the maxi¬ 
mum entropy of Y under the constraint EXj = P\, EX^ = P 2 '. 

(a) If X\ and X 2 are independent. 

(b) If X\ and X 2 are allowed to be dependent. 
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12.21 Entropy rate 

(a) Find the maximum entropy rate stochastic process {Xi} with 
EXj = 1, EXiXi + 2 = a，/ = 1, 2, •... Be careful. 

(b) What is the maximum entropy rate? 

(c) What is EXiXi + \ for this process? 

12.22 Minimum expected value 

(a) Find the minimum value of EX over all probability density 
functions f{x) satisfying the following three constraints: 

(i) f{x) = 0 for x < 0. 

(ii) /°^ f(x)dx = 1 . 

(iii) h{f) = h. 

(b) Solve the same problem if (i) is replaced by 

(i ’） f{x) =0 for x < a. 


HISTORICAL NOTES 

The maximum entropy principle arose in statistical mechanics in the 
nineteenth century and has been advocated for use in a broader con¬ 
text by Jaynes [294]. It was applied to spectral estimation by Burg [80]. 
The information-theoretic proof of Burg’s theorem is from Choi and 
Cover [98]. 



CHAPTER 13 


UNIVERSAL SOURCE CODING 


Here we develop the basics of universal source coding. Minimax regret 
data compression is defined, and the descriptive cost of universality is 
shown to be the information radius of the relative entropy ball containing 
all the source distributions. The minimax theorem shows this radius to 
be the channel capacity for the associated channel given by the source 
distribution. Arithmetic coding enables the use of a source distribution 
that is learned on the fly. Finally, individual sequence compression is 
defined and achieved by a succession of Lempel-Ziv parsing algorithms. 

In Chapter 5 we introduced the problem of finding the shortest rep¬ 
resentation of a source, and showed that the entropy is the fundamental 
lower limit on the expected length of any uniquely decodable represen¬ 
tation. We also showed that if we know the probability distribution for 
the source, we can use the Huffman algorithm to construct the optimal 
(minimal expected length) code for that distribution. 

For many practical situations, however, the probability distribution 
underlying the source may be unknown, and we cannot apply the methods 
of Chapter 5 directly. Instead, all we know is a class of distributions. One 
possible approach is to wait until we have seen all the data, estimate the 
distribution from the data, use this distribution to construct the best code, 
and then go back to the beginning and compress the data using this code. 
This two-pass procedure is used in some applications where there is a 
fairly small amount of data to be compressed. But there are many situa¬ 
tions in which it is not feasible to make two passes over the data, and it 
is desirable to have a one-pass (or online) algorithm to compress the data 
that “learns” the probability distribution of the data and uses it to com¬ 
press the incoming symbols. We show the existence of such algorithms 
that do well for any distribution within a class of distributions. 

In yet other cases, there is no probability distribution underlying the 
data — all we are given is an individual sequence of outcomes. Examples 
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of such data sources include text and music. We can then ask the question: 
How well can we compress the sequence? If we do not put any restric¬ 
tions on the class of algorithms, we get a meaningless answer — there 
always exists a function that compresses a particular sequence to one 
bit while leaving every other sequence uncompressed. This function is 
clearly “overfitted” to the data. However, if we compare our performance 
to that achievable by optimal word assignments with respect to Bernoulli 
distributions or 众 th-order Markov processes, we obtain more interesting 
answers that are in many ways analogous to the results for the probabilis¬ 
tic or average case analysis. The ultimate answer for compressibility for 
an individual sequence is the Kolmogorov complexity of the sequence, 
which we discuss in Chapter 14. 

We begin the chapter by considering the problem of source coding as 
a game in which the coder chooses a code that attempts to minimize 
the average length of the representation and nature chooses a distribution 
on the source sequence. We show that this game has a value that is 
related to the capacity of a channel with rows of its transition matrix that 
are the possible distributions on the source sequence. We then consider 
algorithms for encoding the source sequence given a known or “estimated” 
distribution on the sequence. In particular, we describe arithmetic coding, 
which is an extension of the Shannon-Fano-Elias code of Section 5.9 
that permits incremental encoding and decoding of sequences of source 
symbols. 

We then describe two basic versions of the class of adaptive dictionary 
compression algorithms called Lempel-Ziv, based on the papers by Ziv 
and Lempel [603, 604]. We provide a proof of asymptotic optimality for 
these algorithms, showing that in the limit they achieve the entropy rate 
for any stationary ergodic source. In Chapter 16 we extend the notion of 
universality to investment in the stock market and describe online portfolio 
selection procedures that are analogous to the universal methods for data 
compression. 

13.1 UNIVERSAL CODES AND CHANNEL CAPACITY 


Assume that we have a random variable X drawn according to a dis¬ 
tribution from the family {pe}, where the parameter 6 G {1, 2,..., m} is 
unknown. We wish to find an efficient code for this source. 

From the results of Chapter 5, if we know 0, we can construct a code 
with codeword lengths l(x) = log 仲 achieving an average codeword 
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length equal to the entropy Hq{x) = — po (x) log pe(x), and this is the 

best that we can do. For the purposes of this section, we will ignore the 
integer constraints on Z(x), knowing that applying the integer constraint 
will cost at most one bit in expected length. Thus, 
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Pe 


log 


Pe(X)j 


H{p e ). 


(13.1) 


What happens if we do not know the true distribution po, yet wish to 
code as efficiently as possible? In this case, using a code with codeword 
lengths l{x) and implied probability q{x) = 2~ l ^ x \ we define the redun¬ 
dancy of the code as the difference between the expected length of the 
code and the lower limit for the expected length: 
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(13.2) 

(13.3) 

(13.4) 

(13.5) 

(13.6) 


where q(x) = 2~ 1 ^ is the distribution that corresponds to the codeword 
lengths l{x). 

We wish to find a code that does well irrespective of the true distribution 
po, and thus we define the minimax redundancy as 


/?* = min max R(po, q) = min max D{po\\q). (13.7) 

q pe q pe 

This minimax redundancy is achieved by a distribution q that is at the 
“center” of the information ball containing the distributions p ❷, that is, 
the distribution q whose maximum distance from any of the distributions 
po is minimized (Figure 13.1). 

To find the distribution q that is as close as possible to all the possible 
po in relative entropy, consider the following channel: 
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FIGURE 13.1. Minimum radius information ball containing all the p 0 ’s 
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• p \ ••• 

1 • P2 • • • 

• Po • • • 

• Pm • • *. 


X. 


(13.8) 


This is a channel {0, pe(x), A] with the rows of the transition matrix equal 
to the different pe \ the possible distributions of the source. We will show 
that the minimax redundancy /?* is equal to the capacity of this channel, 
and the corresponding optimal coding distribution is the output distribution 
of this channel induced by the capacity-achieving input distribution. The 
capacity of this channel is given by 

C = max 1(0; X) = max n{0)pe (x) log 抑⑴ ， (13.9) 

7T ⑹ 71 ( 6 ) ^ q n (x) 

where 

qjr(x) = ^7i(6)pe(x). (13.10) 

6 


The equivalence of /?* and C is expressed in the following theorem: 

Theorem 13.1.1 (Gallager [229]，Ryabko [450]) The capacity of a 

channel p(x\6) with rows pu P2, …， Pm is given by 

C = R* = min max D(pe\\q). (13.11) 

q o 
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The distribution q that achieves the minimum in (13.11) is the output 
distribution q^{x) induced be the capacity-achieving input distribution 

丌*(0): 


q*(x) = q n *(x) = ^ jt* (0)/7^ (x). 

6 

(13.12) 

Proof: Let n{6) be an input distribution on 0 g {1, 2,.. 
the induced output distribution be q n : 

.,m}, and let 

m 

(?7r) j — > : 冗 i Pij ， 

/ =1 

(13.13) 

where p" = po (x) for 0 = i，x = j • Then for any distribution q on the 
output, we have 

1^(6-, X) = pi j log J 

i，j m 

(13.14) 

= 〉 ] JCjD^pj 

(13.15) 
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(13.19) 

= Y2^iD(pi\\q) - D{q n \\q) 

i 

(13.20) 

< 'Y^n i D{p i \\q) 

(13.21) 


for all q, with equality iff q = q n . Thus, for all q, 

^7TiD(pi\\q) > 'Y^TZ i D{p i \\q JI ), (13.22) 
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and therefore 

IA0\X) = minY^D(PiWq) (13.23) 

q ^ 

i 

is achieved when q = q n . Thus, the output distribution that minimizes the 
average distance to all the rows of the transition matrix is the the output 
distribution induced by the channel (Lemma 10.8.1). 

The channel capacity can now be written as 

C = max4(0;X) 

IX 

=max min / 7 r/D(p / \\q). 

n q 

i 

We can now apply a fundamental theorem of game theory, which states 
that for a continuous function f(x, y), x e X, y e y, if /(x, y) is convex 
in x and concave in y, and X, 3 ^ are compact convex sets, then 

min max f(x, y) = max min f(x, y). (13.26) 

xeX yey yey xeM 

The proof of this minimax theorem can be found in [305, 392]. 

By convexity of relative entropy (Theorem 2.7.2), J]-^£)(/ 7 /1|^) is 
convex in q and concave in 7r, and therefore 


C = max min 7 7iiD{pi\\q) 

71 q 

i 

(13.27) 

=min max 7 7tiD(pi\\q) 

q n 

i 

(13.28) 

=min max D(pi \\q), 

(13.29) 


q 


where the last equality follows from the fact that the maximum is achieved 
by putting all the weight on the index i maximizing D(pi\\q) in (13.28). 
It also follows that = q n *. This completes the proof. □ 

Thus, the channel capacity of the channel from 0 to X is the minimax 
expected redundancy in source coding. 

Example 13. 7 .1 Consider the case when X = {1, 2, 3} and 6 takes only 
two values, 1 and 2 , and the corresponding distributions are p\ = {\ — a, 
a, 0) and p 2 = (0, a, 1 — a). We would like to encode a sequence of 
symbols from X without knowing whether the distribution is p\ or p 2 . 
The arguments above indicate that the worst-case optimal code uses the 


(13.24) 

(13.25) 
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codeword lengths corresponding to the distribution that has a minimal 
relative entropy distance from both distributions, in this case, the midpoint 
of the two distributions. Using this distribution, q = a, we 

achieve a redundancy of 


D{p\\\q) = D(p 2 \\q) = (1 - a) log —^~+ a log - + 0 = 1 - a. 

(1 — a)/2 a 

(13.30) 

The channel with transition matrix rows equal to p\ and p2 is equivalent 
to the erasure channel (Section 7.1.5), and the capacity of this channel can 
easily be calculated to be (1 — a), achieved with a uniform distribution on 
the inputs. The output distribution corresponding to the capacity-achieving 
input distribution is equal to a, (i.e., the same as the distri¬ 
bution q above). Thus, if we don’t know the distribution for this class of 
sources, we code using the distribution q rather than p\ or p2, and incur 
an additional cost of 1 — a bits per source symbol above the ideal entropy 
bound. 


13.2 UNIVERSAL CODING FOR BINARY SEQUENCES 


Now we consider an important special case of encoding a binary sequence 
x n G {0, \} n . We do not make any assumptions about the probability dis¬ 
tribution for xi, X 2 ,..., x n . 

We begin with bounds on the size of (=)，taken from Wozencraft and 
Reiffen [567] proved in Lemma 17.5.1: For ^ ^ 0 or n, 



n 

名 k(n — k) _ 



2^—nH(k/n) 


< 


7zk(n — k) 


(13.31) 


We first describe an offline algorithm to describe the sequence; we 
count the number of T s in the sequence, and after we have seen the entire 
sequence, we send a two-stage description of the sequence. The first stage 
is a count of the number of Vs in the sequence [i.e., k = • Xi (using 

[log(n + 1)] bits)], and the second stage is the index of this sequence 
among all sequences that have k Vs (using「log Q)] bits). This two-stage 
description requires total length 

Kx n ) < log(« + l)+log Q + 2 (13.32) 

/ k\ 1 1 / k {n — k)\ 

< \ogn + nH ( — ) — - log n — - log ( n - ) + 3 (13.33) 

\n / 2 2 \ n n / 
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=nH 



+ -\ogn - - log 



+ 3. 


(13.34) 


Thus, the cost of describing the sequence is approximately ^logn bits 
above the optimal cost with the Shannon code for a Bernoulli distribution 
corresponding to k/n. The last term is unbounded at k = 0 or k = n 9 so 
the bound is not useful for these cases (the actual description length is 
\og(n + 1) bits, whereas the entropy H{k/n) = 0 when k = 0 or k = n). 

This counting approach requires the compressor to wait until he has 
seen the entire sequence. We now describe a different approach using a 
mixture distribution that achieves the same result on the fly. We choose 
the coding distribution q(x\, X 2 ,..., x n ) = 2 —’( 々 ’ 又 2 ， … ，々 ） to be a uniform 
mixture of all Bernoulli(0) distributions on xi, X 2 ,..., We will analyze 
the performance of a code using this distribution and show that such codes 
perform well for all input sequences. 

We construct this distribution by assuming that 6, the parameter of 
the Bernoulli distribution is drawn according to a uniform distribution on 
[0, 1 ]. The probability of a sequence x\, X 2 , •.., x n with k ones is 6 k {\ — 
6) n ~ k under the Bernoulli(0) distribution. Thus, the mixture probability 
of the sequence is 

p(xu X 2 ,..., x n ) = f e k (l - e) n ~ k de = A(n, k). ( 13 . 35 ) 

Jo 

Integrating by parts, setting u = (l — 6) n ~ k and dv = 0 k d9, we have 


or 



(13.36) 


A(n, k )= 


n — k 
k + l 


A(n, k + 1). 


(13.37) 


Now A(n, n) = 9 n d0 = and we can easily verify from the recur¬ 
sion that 

p(x\, X 2 ,..., x n ) = A(n, k) = --- 7 ^-. (13.38) 

n + l I n\ 
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The codeword length with respect to the mixture distribution is 


log 


q(x n ) 


< 


log(n + 1) + log 



+ 1， 


(13.39) 


which is within one bit of the length of the two-stage description above. 
Thus, we have a similar bound on the codeword length 


l(xi,x 2 ,... < 



(13.40) 


for all sequences x\, X 2 ,. • •, x n . This mixture distribution achieves a code¬ 
word length within \ logn bits of the optimal code length nH{k/n) that 
would be required if the source were really Bernoulli(A:/n), without any 
assumptions about the distribution of the source. 

This mixture distribution yields a nice expression for the conditional prob¬ 
ability of the next symbol given the previous symbols of xi, X 2 ,..., x n . Let 
lq be the number of Ts in the first i symbols of jq, X 2 ,..., Using (13.38), 
we have 


q(x l ) 





1 (kj + mn-ktV kiKi-ki)\ 

i + 2 (/ + !)! il ) T\ 


ki + 1 
= i+2 


(13.41) 

(13.42) 

(13.43) 

(13.44) 


This is the Bayesian posterior probability of 1 given the uniform prior 
on 9, and is called the Laplace estimate for the probability of the next 
symbol. We can use this posterior probability as the probability of the next 
symbol for arithmetic coding, and achieve the codeword length log 
in a sequential manner with finite-precision arithmetic. This is a horizon- 
free result, in that the procedure does not depend on the length of the 
sequence. 

One issue with the uniform mixture approach or the two-stage approach 
is that the bound does not apply for A: = 0 or k = n. The only uni¬ 
form bound that we can give on the extra redundancy is logn, which 
we can obtain by using the bounds of (11.40). The problem is that 
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we are not assigning enough probability to sequences with A: = 0 or 
k = n. If instead of using a uniform distribution on 6, we used the 
Dirichlet(^, distribution, also called the Beta(^, distribution, the 
probability of a sequence x\, X 2 ,..., x n becomes 


q^{x n ) 


e k (i - e) n ~ k 




: de 


(13.45) 


and it can be shown that this achieves a description length 

1 1 71 

log - < H{k/n) + - log m + log — 

^I(^) 2 8 


(13.46) 


for all x n G {0, l} n , achieving a uniform bound on the redundancy of the 
universal mixture code. As in the case of the uniform prior, we can cal¬ 
culate the conditional distribution of the next symbol, given the previous 
observations, as 

qi _{ Xi+l = \\x l ) = ^—^, (13.47) 

2 l + 1 


which can be used with arithmetic coding to provide an online algorithm 
to encode the sequence. We will analyze the performance of the mix¬ 
ture algorithm in greater detail when we analyze universal portfolios in 
Section 16.7. 


13.3 ARITHMETIC CODING 

The Huffman coding procedure described in Chapter 5 is optimal for 
encoding a random variable with a known distribution that has to be 
encoded symbol by symbol. However, due to the fact that the codeword 
lengths for a Huffman code were restricted to be integral, there could be a 
loss of up to 1 bit per symbol in coding efficiency. We could alleviate this 
loss by using blocks of input symbols — however, the complexity of this 
approach increases exponentially with block length. We now describe a 
method of encoding without this inefficiency. In arithmetic coding, instead 
of using a sequence of bits to represent a symbol, we represent it by a 
subinterval of the unit interval. 

The code for a sequence of symbols is an interval whose length decreases 
as we add more symbols to the sequence. This property allows us to have a 
coding scheme that is incremental (the code for an extension to a sequence 
can be calculated simply from the code for the original sequence) and for 
which the codeword lengths are not restricted to be integral. The motivation 
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for arithmetic coding is based on Shannon — Fano_Elias coding (Section 5.9) 
and the following lemma: 

Lemma 13.3.1 Let Y be a random variable with continuous probability 
distribution function F{y). Let U = F{Y) (i.e., U is a function of Y defined 
by its distribution function). Then U is uniformly distributed on [0, 1]. 

Proof: Since F{y) g [0, 1], the range of U is [0, 1]. Also, for u e [0, 1], 


F v {u) = Pr(U < u) (13.48) 

= Pr(F(Y) < u) (13.49) 

= Pr(y < F~\u)) (13.50) 

= F{F~\u)) (13.51) 

= u, (13.52) 

which proves that U has a uniform distribution in [0, 1]. □ 


Now consider an infinite sequence of random variables X\, X 2 , ... from 
a finite alphabet A" = 0, 1 ， 2,… ， m. For any sequence xi ， 义 2 ,…， from 
this alphabet, we can place 0. in front of the sequence and consider it as 
a real number (base m + 1) between 0 and 1. Let X be the real-valued 

random variable X = O.X 1 X 2 _Then X has the following distribution 

function: 


F x (x) = Pr{X <x = 0.xiX2--*} (13.53) 

= Pr{0.XiX 2 < 0.xiX2--*} (13.54) 

=Pr{Zi < xi} + Pr{Xi = x\, X 2 < X 2 } + • • • • (13.55) 

Now let U = F x (X) = F x (0.X x X 2 • • •) = O.FiF 2 .... If the distribution 
on infinite sequences X°° has no atoms, then, by the lemma above, U has a 
uniform distribution on [0, 1], and therefore the bits F 1 F 2 ... in the binary 
expansion of U are Bemoulli(j) (i.e., they are independent and uniformly 
distributed on {0, 1}). These bits are therefore incompressible, and form a 

compressed representation of the sequence O.X 1 X 2 _For Bernoulli or 

Markov models, it is easy to calculate the cumulative distribution function, 
as illustrated in the following example. 

Example 73.3.7 Let X 1 , X 2 ,X n bo Bernoulli(p). Then the sequence 
x n = 110101 maps into 
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F{x n ) = Pr(Xi < 1) + Pr(Xi = l,X 2 < 1) 

+ Pr(Xi = l,X 2 =hX 3 <0) 

+ Pr(X 1 = l ， X 2 =l ， X 3 = 0, X 4 < 1) 

+ Pr(X! = 1 ， X 2 = 1， = 0, X 4 = 1, X 5 < 0) 

+ Pr(Xi = 1, X 2 = 1 ， X 3 = 0, X 4 = 1 ， X 5 = 0, X 6 < 1) 

(13.56) 

=q + pq + p 2 -0 + p 2 q-q + p 2 qp-0+ p 2 qpqq (13.57) 

=q + pq + p 2 q 2 + p 3 q 3 . (13.58) 


Note that each term is easily computed from the previous terms. In general, 
for an arbitrary binary process {X/}, 

n 

F(x n ) = Y] p(x k ~ l 0)x k . (13.59) 

k=\ 


The probability transform thus forms an invertible mapping from infi¬ 
nite source sequences to incompressible infinite binary sequences. We 
now consider the compression achieved by this transformation on finite 
sequences. Let X\, X 2 ,... ,X n be a sequence of binary random vari¬ 
ables of length n, and let x\, X 2 ,.. •, x n be a particular outcome. We 
can treat this sequence as representing an interval [O.X 1 X 2 ... x„000 ..., 
O.X 1 X 2 .. .x n llll ..or equivalently, [O.X 1 X 2 ..., O.X 1 X 2 .. .x n + 
( 士 ) w ). This is the set of infinite sequences that start with O.X 1 X 2 • • • x n . 
Under the probability transform, this interval gets mapped into another 
interval, [Fy( 0 .xiX 2 - - - x n ), Fy{ 0 .x\X 2 • — x n + 广 ）），whose length is 
equal to Px(x\, X 2 ,, x n ), the sum of the probabilities of all infinite 
sequences that start with O.X 1 X 2 • • • x n . Under the probability inverse trans¬ 
form, any real number u within this interval maps into a sequence that 
starts with xi, X 2 ,..., x n , and therefore given u and n, we can recon¬ 
struct x\, X 2 , •.., x n . The Shannon-Fano-Elias coding scheme described 
earlier allows one to construct a prefix-free code of length log 
外 ^ —— —+ 2 bits, and therefore it is possible to encode the sequence 

x\, X 2 ,..., x n with this length. Note that log 厂⑴ 1 ~ — is the ideal code¬ 
word length for x n . 

The process of encoding the sequence with the cumulative distribution 
function described above assumes arbitrary accuracy for the computa¬ 
tion. In practice, though, we have to implement all numbers with finite 
precision, and we describe such an implementation. The key is to consider 
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not infinite-precision points for the cumulative distribution function but 
intervals in the unit interval. Any finite-length sequence of symbols can 
be said to correspond to a subinterval of the unit interval. The objective 
of the arithmetic coding algorithm is to represent a sequence of random 
variables by a subinterval in [0, 1]. As the algorithm observes more input 
symbols, the length of the subinterval corresponding to the input sequence 
decreases. As the top end of the interval and the bottom end of the inter¬ 
val get closer, they begin to agree in the first few bits. These will be first 
few bits of the output sequence. As soon as the two ends of the interval 
agree, we can output the corresponding bits. We can therefore shift these 
bits out of the calculation and effectively scale the remaining intervals so 
that entire calculation can be done with finite precision. We will not go 
into the details here — there is a very good description of the algorithm 
and performance considerations in Bell et al. [41] 

Example 13.3.2 {Arithmetic coding for a ternary input alphabet) Con¬ 
sider a random variable X with a ternary alphabet {A, B, C}, which 
are assumed to have probabilities 0.4, 0.4, and 0.2, respectively. Let 
the sequence to be encoded by ACAA. Thus, Fi(.) = (0, 0.4, 0.8) and 
Fh(-) = (0.4, 0.8, 1.0). Initially, the input sequence is empty, and the cor¬ 
responding interval is [0, 1). The cumulative distribution function after 
the first input symbol is shown in Figure 13.2. It is easy to calculate that 
the interval in the algorithm without scaling after the first symbol A is 


F(x n ) 
1.0 Jl 

0 . 8 - 


0.4 


C x n 


FIGURE 13.2. Cumulative distribution function after the first symbol. 









440 


UNIVERSAL SOURCE CODING 


F(x n ) 



FIGURE 13.3. Cumulative distribution function after the second symbol. 


[0, 0.4); after the second symbol, C, it is [0.32, 0.4) (Figure 13.3); after 
the third symbol A, it is [0.32,0.352); and after the fourth symbol A, it 
is [0.32, 0.3328). Since the probability of this sequence is 0.0128, we 
will use log(l/0.0128) + 2 (i.e., 9 bits) to encode the midpoint of the 
interval sequence using Shannon-Fano-Elias coding (0.3264, which is 
0.010100111 binary). 

In summary, the arithmetic coding procedure, given any length n and 
probability mass function q(x\X 2 - - - x n ), enables one to encode the sequence 
x\X 2 - - - x n in a code of length log " ⑴ ' 上 " + 2 bits. If the source is i.i.d. 
and the assumed distribution q is equal to the true distribution p of the data, 
this procedure achieves an average length for the block that is within 2 bits 
of the entropy. Although this is not necessarily optimal for any fixed block 
length (a Huffman code designed for the distribution could have a lower 
average codeword length), the procedure is incremental and can be used for 
any blocklength. 

13.4 LEMPEL-ZIV CODING 

In Section 13.3 we discussed the basic ideas of arithmetic coding and 
mentioned some results on worst-case redundancy for coding a sequence 
from an unknown distribution. We now discuss a popular class of tech¬ 
niques for source coding that are universally optimal (their asymptotic 
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compression rate approaches the entropy rate of the source for any sta¬ 
tionary ergodic source) and simple to implement. This class of algorithms 
is termed Lempel-Ziv, named after the authors of two seminal papers 
[603, 604] that describe the two basic algorithms that underlie this class. 
The algorithms could also be described as adaptive dictionary compression 
algorithms. 

The notion of using dictionaries for compression dates back to the 
invention of the telegraph. At the time, companies were charged by the 
number of letters used, and many large companies produced codebooks for 
the frequently used phrases and used the codewords for their telegraphic 
communication. Another example is the notion of greetings telegrams 
that are popular in India 一 there is a set of standard greetings such as 
“25:Merry Christmas” and “26:May Heaven’s choicest blessings be show¬ 
ered on the newly married couple.” A person wishing to send a greeting 
only needs to specify the number, which is used to generate the actual 
greeting at the destination. 

The idea of adaptive dictionary-based schemes was not explored until 
Ziv and Lempel wrote their papers in 1977 and 1978. The two papers 
describe two distinct versions of the algorithm. We refer to these ver¬ 
sions as LZ77 or sliding window Lempel-Ziv and LZ78 or tree-structured 
Lempel-Ziv. (They are sometimes called LZ1 and LZ2, respectively.) 

We first describe the basic algorithms in the two cases and describe 
some simple variations. We later prove their optimality, and end with 
some practical issues. The key idea of the Lempel-Ziv algorithm is to 
parse the string into phrases and to replace phrases by pointers to where 
the same string has occurred in the past. The differences between the 
algorithms is based on differences in the set of possible match locations 
(and match lengths) the algorithm allows. 

13.4.1 Sliding Window Lempel-Ziv Algorithm 

The algorithm described in the 1977 paper encodes a string by finding the 
longest match anywhere within a window of past symbols and represents 
the string by a pointer to location of the match within the window and the 
length of the match. There are many variations of this basic algorithm, 
and we describe one due to Storer and Szymanski [507]. 

We assume that we have a string xi, X 2 ,... to be compressed from a 
finite alphabet. A parsing 5 of a string x\X 2 • • • x n is a division of the 
string into phrases, separated by commas. Let W be the length of the 
window. Then the algorithm can be described as follows: Assume that 
we have compressed the string until time / — 1. Then to find the next 
phrase, find the largest k such that for some j, i — l — W < j < i — l. 
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the string of length k starting at Xj is equal to the string (of length k) 
starting at Xi (i.e., Xj + i = for all 0 < / < k). The next phrase is then 

of length k (i.e., jc/ • • • 灸 _i) and is represented by the pair ( 尸， L), where 
P is the location of the beginning of the match and L is the length of 
the match. If a match is not found in the window, the next character is 
sent uncompressed. To distinguish between these two cases, a flag bit 
is needed, and hence the phrases are of two types: (F, P, L) or (F, C), 
where C represents an uncompressed character. 

Note that the target of a (pointer,length) pair could extend beyond the 
window, so that it overlaps with the new phrase. In theory, this match 
could be arbitrarily long; in practice, though, the maximum phrase length 
is restricted to be less than some parameter. 

For example, if W = 4 and the string is ABBABBABBBAABABA 
and the initial window is empty, the string will be parsed as follows: 
A ， B ， B ， ABBABB ， BA ， A,BA ， BA，which is represented by the sequence of 
“pointers”: (0 ， A) ， (0 ， B) ， (1 ， 1 ， 1) ， (1 ， 3,6) ， (1 ， 4,2) ， (1 ， 1 ， 1) ， (1 ， 3,2) ， (1 ， 2,2)，where 
the flag bit is 0 if there is no match and 1 if there is a match, and the 
location of the match is measured backward from the end of the window. 
[In the example, we have represented every match within the window 
using the (P, L) pair; however, it might be more efficient to represent 
short matches as uncompressed characters. See Problem 13.8 for details.] 

We can view this algorithm as using a dictionary that consists of all 
substrings of the string in the window and of all single characters. The 
algorithm finds the longest match within the dictionary and sends a pointer 
to that match. We later show that a simple variation on this version of 
LZ77 is asymptotically optimal. Most practical implementations of LZ77, 
such as gzip and pkzip, are also based on this version of LZ77. 


13.4.2 Tree-Structured Lempel-Ziv Algorithms 

In the 1978 paper, Ziv and Lempel described an algorithm that parses a 
string into phrases, where each phrase is the shortest phrase not seen ear¬ 
lier. This algorithm can be viewed as building a dictionary in the form of 
a tree, where the nodes correspond to phrases seen so far. The algorithm is 
particularly simple to implement and has become popular as one of the early 
standard algorithms for file compression on computers because of its speed 
and efficiency. It is also used for data compression in high-speed modems. 

The source sequence is sequentially parsed into strings that have not 
appeared so far. For example, if the string is ABB ABB ABBBAABABAA 
...,we parse it as A,B ， BA ， BB ， AB ， BBA,ABA，BAA •… After every com¬ 
ma, we look along the input sequence until we come to the shortest string 
that has not been marked off before. Since this is the shortest such string, 
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all its prefixes must have occurred earlier. (Thus, we can build up a tree 
of these phrases.) In particular, the string consisting of all but the last bit 
of this string must have occurred earlier. We code this phrase by giving 
the location of the prefix and the value of the last symbol. Thus, the string 
above would be represented as (0,A) ， (0 ， B),(2,A),(2,B),(1,B),(4 ， A),(5,A), 

(3,A),.... 

Sending an uncompressed character in each phrase results in a loss of 
efficiency. It is possible to get around this by considering the extension 
character (the last character of the current phrase) as part of the next 
phrase. This variation, due to Welch [554], is the basis of most practical 
implementations of LZ78, such as compress on Unix, in compression in 
modems, and in the image files in the GIF format. 

13.5 OPTIMALITY OF LEMPEL-ZIV ALGORITHMS 

13.5.1 Sliding Window Lempel-Ziv Algorithms 

In the original paper of Ziv and Lempel [603], the authors described the 
basic LZ77 algorithm and proved that it compressed any string as well 
as any finite-state compressor acting on that string. However, they did 
not prove that this algorithm achieved asymptotic optimality (i.e.，that the 
compression ratio converged to the entropy for an ergodic source). This 
result was proved by Wyner and Ziv [591]. 

The proof relies on a simple lemma due to Kac: the average length of 
time that you need to wait to see a particular symbol is the reciprocal of 
the probability of a symbol. Thus, we are likely to see the high-probability 
strings within the window and encode these strings efficiently. The strings 
that we do not find within the window have low probability, so that 
asymptotically, they do not influence the compression achieved. 

Instead of proving the optimality of the practical version of LZ77, we 
will present a simpler proof for a different version of the algorithm, which, 
though not practical, captures some of the basic ideas. This algorithm 
assumes that both the sender and receiver have access to the infinite past 
of the string, and represents a string of length n by pointing to the last 
time it occurred in the past. 

We assume that we have a stationary and ergodic process defined 
for time from —oo to oo, and that both the encoder and decoder have 
access to , X_ 2 , X_i, the infinite past of the sequence. Then to encode 
Xo, X \,..., X n -\ (a block of length n )， we find the last time we have 
seen these n symbols in the past. Let 

^ 1 ? • • • » ^n—\) ~ 

maxfy < 0 : (X-j ，• • • X_ 7 - +n _i) = (Z 0 , • • • ， X n ^\)}. (13.60) 
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Then to represent Xo, …， X n -\, we need only to send R n to the receiver, 
who can then look back R n bits into the past and recover Xo, ..., X n -\. 
Thus, the cost of the encoding is the cost of representing R n . We will show 
that this cost is approximately log R n and that asymptotically ^ E log R n 
—> H(X), thus proving the asymptotic optimality of this algorithm. 

We will need the following lemmas. 

Lemma 13.5.1 There exists a prefix-free code for the integers such that 
the length of the codeword for integer k is log A: + 2 log log k + 0(1). 

Proof: If we knew that k < m,we could encode k with log m bits. How¬ 
ever, since we don’t have an upper limit for k, we need to tell the receiver 
the length of the encoding of k (i.e., we need to specify log k). Consider 
the following encoding for the integer k\ We first represent [log k~\ in 
unary, followed by the binary representation of k: 

C x {k) = 00_^_0 1 xx_-_x . (13.61) 

riogA：] O’s k in binary 

It is easy to see that the length of this representation is 2「log k~\ + l < 
2log 众 + 3. This is more than the length we are looking for since we are 
using the very inefficient unary code to send log L However, if we use C\ 
to represent log k, it is now easy to see that this representation has a length 
less than log k + 2 log log k + 4, which proves the lemma. A similar method 
is presented in the discussion following Theorem 14.2.3. □ 

The key result that underlies the proof of the optimality of LZ77 is 
Kac’s lemma, which relates the average recurrence time to the proba¬ 
bility of a symbol for any stationary ergodic process. For example, if 
X\, X 2 ,..., X n is an i.i.d. process, we ask what is the expected waiting 
time to see the symbol a again, conditioned on the fact that X\ = a. In 
this case, the waiting time has a geometric distribution with parameter 
p = p(Xo = a), and thus the expected waiting time is 1/ p(Xo = a). The 
somewhat surprising result is that the same is true even if the process is 
not i.i.d., but stationary and ergodic. A simple intuitive reason for this 
is that in a long sample of length n, we would expect to see a about 
np(a) times, and the average distance between these occurrences of a is 
n/(np(a)) (i.e., l/p(a)). 

Lemma 13.5.2 (Kac) Let ..., U 2 , U\, Uo, U\,... be a stationary 
ergodic process on a countable alphabet. For any u such that p{u) > 0 


and for i = 1 ， 2, …， let 
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Q u {i) = PT{U-i =u\Uj ^ ufor -i < j < 0|t/ 0 = u) (13.62) 

[i.e., Q u (i) is the conditional probability that the most recent previous 
occurrence of the symbol u is i, given that Uq = u]. Then 

E(RdU)\X 0 = u) = YiQ u (i) = (13.63) 

^ p{u) 

Thus, the conditional expected waiting time to see the symbol u again, 
looking backward from zero, is 1/p{u). 

Note the amusing fact that the expected recurrence time 

ERdU) = Yp(u)-^- = m, (13.64) 

^ p(u) 

where m is the alphabet size. 

Proof: Let Uo = u. Define the events for 7 = 1, 2,... and/: = 0, 1, 2, : 

Ajk = { U—j = u, Ui 參 u ， 一 j < l < k, U k = u 、 • (13.65) 

Event Ajk corresponds to the event where the last time before zero at 
which the process is equal to u is at — j, the first time after zero at which 
the process equals u is k. These events are disjoint, and by ergodicity, the 
probability Pr{Uj^Ajk} = 1. Thus, 

l=Pr{u 乂為 ] (13.66) 

(X) OO 

=(13.67) 

7=1 k=0 

OO OO 

= Pr (t4 = M) Pr{U-j =u,Ui^ u, -j <1 < k\U k = u] 

7=1 k=0 

(13.68) 

0000 

= 奸肌 = ^)Qu(j+k) (13.69) 

j=l k=Q 
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oo oo 

=J]Pr(f/o = M)e»0'+^) 

7=1 k=0 

oo oo 

= Vr(U 0 = u)J2 J EQu(j + k) 

j=\ k=0 


(d) 


Pr(Uo = u)^iQ u (i), 


(13.70) 

(13.71) 

(13.72) 


where (a) follows from the fact that the Ajk are disjoint, (b) follows from 
the definition of Q u (-), (c) follows from stationarity, and (d) follows from 
the fact that there are i pairs ( 7 , k) such that j + k = i in the sum. Kac’s 
lemma follows directly from this equation. □ 


Corollary Let ..., X_i, Xq ， Xi ， … be a stationary ergodic process and 
let R n (Xo, .… ， X n -\) be the recurrence time looking backward as defined 
in (13.60). Then 

• • • ， X^OKZo,..., X n ^) = x^~ l ] = — l-p. (13.73) 

L 」 ) 

Proof: Define a new process with [// = X/ + i,..., Z/ +n _i). The U 

process is also stationary and ergodic, and thus by Kac’s lemma the aver¬ 
age recurrence time for U conditioned on U{) = u is 1/p{u). Translating 
this to the X process proves the corollary. □ 

We are now in a position to prove the main result, which shows that 
the compression ratio for the simple version of Lempel-Ziv using recur¬ 
rence time approaches the entropy. The algorithm describes Xq~ 1 by 
describing /? 71 (Xq _ 1 ), which by Lemma 13.5.1 can be done with log R n + 
2 log log R n + 4 bits. We now prove the following theorem. 

Theorem 13.5.1 Let L n {X^~ x ) = log R n + 2 log log R n + 0(1) be the 
description length for X^~ l in the simple algorithm described above. Then 

-EL n {X n - x ) H{X) (13.74) 

n 

as n ^ oo y where H(X) is the entropy rate of the process {Z/}. 
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Proof: We will prove upper and lower bounds for EL n . The lower bound 
follows directly from standard source coding results (i.e., EL n > nH for 
any prefix-free code). To prove the upper bound, we first show that 


Y\m—E log R n < H 
n 


(13.75) 


and later bound the other terms in the expression for L n . To prove the 
bound for Elog R n ，we expand the expectation by conditioning on the 
value of Xq _1 and then applying Jensen’s inequality. Thus, 


< 


去 E 

-o' 1 

A )E[\ogR n {X n - l )\X n - x =x n ~ x ] 

(13.76) 



(13.77) 

去 E 《 

-r 1 

1)l0g O 

(13.78) 

n 


(13.79) 

H(X). 


(13.80) 


The second term in the expression for L n is log log R n , and we wish to 
show that 

丄 ^loglogiaxjp 1 )] —0_ (13.81) 

n 

Again, we use Jensen’s inequality, 


^ log 印。爾"、 


(13.82) 


^叫- 1 )， 


(13.83) 


where the last inequality follows from (13.79). For any 6 > 0, for large 
enough n, H(Xq~ 1 ) < n(H + 6), and therefore ^ log log R n < ^ 
log n + log(// + 6) — > 0. This completes the proof of the 
theorem. □ 


Thus, a compression scheme that represents a string by encoding the 
last time it was seen in the past is asymptotically optimal. Of course, this 
scheme is not practical, since it assumes that both sender and receiver 
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have access to the infinite past of a sequence. For longer strings, one 
would have to look further and further back into the past to find a match. 
For example, if the entropy rate is \ and the string has length 200 bits, 
one would have to look an average of 2 100 《 10 30 bits into the past to 
find a match. Although this is not feasible, the algorithm illustrates the 
basic idea that matching the past is asymptotically optimal. The proof of 
the optimality of the practical version of LZ77 with a finite window is 
based on similar ideas. We will not present the details here, but refer the 
reader to the original proof in [591]. 

13.5.2 Optimality of Tree-Structured Lempel-Ziv Compression 

We now consider the tree-structured version of Lempel-Ziv, where the 
input sequence is parsed into phrases, each phrase being the shortest string 
that has not been seen so far. The proof of the optimality of this algorithm 
has a very different flavor from the proof for LZ77; the essence of the 
proof is a counting argument that shows that the number of phrases cannot 
be too large if they are all distinct, and the probability of any sequence of 
symbols can be bounded by a function of the number of distinct phrases 
in the parsing of the sequence. 

The algorithm described in Section 13.4.2 requires two passes over the 
string —— in the first pass, we parse the string and calculate c{n), the number 
of phrases in the parsed string. We then use that to decide how many bits 
[log c(n)] to allot to the pointers in the algorithm. In the second pass, we 
calculate the pointers and produce the coded string as indicated above. 
The algorithm can be modified so that it requires only one pass over the 
string and also uses fewer bits for the initial pointers. These modifications 
do not affect the asymptotic efficiency of the algorithm. Some of the 
implementation details are discussed by Welch [554] and Bell et al. [41]. 

We will show that like the sliding window version of Lempel-Ziv, 
this algorithm asymptotically achieves the entropy rate for the unknown 
ergodic source. We first define a parsing of the string to be a decomposition 
into phrases. 

Definition A parsing 5 of a binary string x\X 2 • • • is a division of the 
string into phrases, separated by commas. A distinct parsing is a parsing 
such that no two phrases are identical. For example, 0,111,1 is a distinct 
parsing of 01111, but 0,11,11 is a parsing that is not distinct. 

The LZ78 algorithm described above gives a distinct parsing of the 
source sequence. Let c{n) denote the number of phrases in the LZ78 
parsing of a sequence of length n. Of course, c{n) depends on the sequence 
X n . The compressed sequence (after applying the Lempel-Ziv algorithm) 
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consists of a list of c(n) pairs of numbers, each pair consisting of a pointer 
to the previous occurrence of the prefix of the phrase and the last bit of 
the phrase. Each pointer requires \ogc(n) bits, and hence the total length 
of the compressed sequence is c(n)[logc(n) + 1] bits. We now show that 
c ⑻ (iog"c ⑻ +i) for a stationary ergodic sequence X\, X 2 ,..., X n . 

Our proof is based on the simple proof of asymptotic optimality of LZ78 
coding due to Wyner and Ziv [575]. 

Before we proceed to the details of the proof, we provide an outline 
of the main ideas. The first lemma shows that the number of phrases in 
a distinct parsing of a sequence is less than n/logn; the main argument 
in the proof is based on the fact that there are not enough distinct short 
phrases. This bound holds for any distinct parsing of the sequence, not 
just the LZ78 parsing. 

The second key idea is a bound on the probability of a sequence based 
on the number of distinct phrases. To illustrate this, consider an i.i.d. 
sequence of random variables X\, X 2 , X 3 , X 4 that take on four possible 
values, {A, B, C, D}, with probabilities pa, Pb ， Pc, and pd, respec¬ 
tively. Now consider the probability of a sequence P(D, A, B, C)= 
PdPaPbPc. Since Pa + Pb + Pc + Pd = 1, the product PdPaPbPc is 
maximized when the probabilities are equal (i.e., the maximum value of 
the probability of a sequence of four distinct symbols is 1/256). On the 
other hand, if we consider a sequence A, B, A, B, the probability of this 
sequence is maximized if = p B = ^ p c = p D = 0 , and the maximum 
probability for A, B, A, 5 is A sequence of the form A, A, A, A could 
have a probability of 1. All these examples illustrate a basic point — se¬ 
quences with a large number of distinct symbols (or phrases) cannot have 
a large probability. Ziv’s inequality (Lemma 13.5.5) is the extension of 
this idea to the Markov case, where the distinct symbols are the phrases 
of the distinct parsing of the source sequence. 

Since the description length of a sequence after the parsing grows as 
c log c, the sequences that have very few distinct phrases can be com¬ 
pressed efficiently and correspond to strings that could have a high prob¬ 
ability. On the other hand, strings that have a large number of distinct 
phrases do not compress as well; but the probability of these sequences 
could not be too large by Ziv’s inequality. Thus, Ziv’s inequality enables 
us to connect the logarithm of the probability of the sequence with the 
number of phrases in its parsing, and this is finally used to show that the 
tree-structured Lempel-Ziv algorithm is asymptotically optimal. 

We first prove a few lemmas that we need for the proof of the theorem. 
The first is a bound on the number of phrases possible in a distinct parsing 
of a binary sequence of length n. 
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Lemma 13.5.3 (Lempel and Ziv [604]) The number of phrases c(n) in 
a distinct parsing of a binary sequence X\, X2,... ,X n satisfies 

c ㈨ S - ^^^ ， (13.84) 

(1 - ^)logn 

where € n = min{l, log ^^ +4 } 0 as n ^ 00 . 

Proof: Let 

k 

n k = J2 J 2j = (k- l)2 k+l +2 (13.85) 

7=1 

be the sum of the lengths of all distinct strings of length less than or equal 
to k. The number of phrases c in a distinct parsing of a sequence of length 
n is maximized when all the phrases are as short as possible. If n = 
this occurs when all the phrases are of length < k, and thus 

c{n k ) < J2 2j = 2 " +1 - 2 < 2 k+l < 占 . (13.86) 


If rik < n < we write n = + A, where A < (^ + l)2 k+l . Then 

the parsing into shortest phrases has each of the phrases of length < k 
and A/(k + 1) phrases of length k + l. Thus, 


c{n) < 


rik A 
k — l k + l 


< 


+ A 
k- 1 



(13.87) 


We now bound the size of k for a given n. Let rik < n < Then 

n>n k = (k- l)2 k+1 + 2>2 k , (13.88) 


and therefore 


Moreover, 

n < rik+\ — k2 k+2 


k < logn. 

+ 2 <(k + 2)2 k+1 < (log n + 2)2 k+2 , 


(13.89) 

(13.90) 


by (13.89), and therefore 


k + 2 > log 


n 

log n + 2' 


(13.91) 










or for all n > 4, 
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k — l > log n — log (log n + 2) — 3 


(13.92) 


log (log n + 2) + 3 、 
logn , 


\ogn 


^ (i _log(21ogn) + 3X iogn 


logn 

log(logw) + 4 、 
logn , 


\ogn 


(1 - 6„)logn. 


(13.93) 

(13.94) 

(13.95) 

(13.96) 


Note that € n = min{ 1, lQg (g= )+4 }. Combining (13.96) with (13.87), we 


obtain the lemma. 


□ 


We will need a simple result on maximum entropy in the proof of the 
main theorem. 


Lemma 13.5.4 Let Z be a nonnegative integer-valued random variable 
with mean /x. Then the entropy H(Z) is bounded by 

H{Z) < (/x + l)log(/x + l)-/xlog/x. (13.97) 

Proof: The lemma follows directly from the results of Theorem 12.1.1, 
which show that the geometric distribution maximizes the entropy of a 
nonnegative integer-valued random variable subject to a mean constraint. 

□ 


Let {X i }^_ 00 be a binary stationary ergodic process with probabil¬ 
ity mass function P(x\, X2,..., x n ). (Ergodic processes are discussed in 
greater detail in Section 16.8.) For a fixed integer k, define the 众 th-order 
Markov approximation to P as 

n 

Q k (x- (k -i), ]~[ P(xj\x J jZ l k ), (13.98) 

7=1 

where xj = (x/, , Xj), i < j, and the initial state X -{k-\) will be 

part of the specification of Q^. Since P(X n \X^zl) is itself an ergodic 
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process, we have 

1 1 n 

--log QkiXu x 2 ,..., = --J]log P(Xj\X J rl) 

n n 7=1 

(13.99) 

^ - £logP(X y |Xj：；) (13.100) 

= H{Xj\X j r\). (13.101) 

We will bound the rate of the LZ78 code by the entropy rate of the ^th- 
order Markov approximation for all k. The entropy rate of the Markov 
approximation H(Xj\X^Zl) converges to the entropy rate of the process 
as A: ^ oo, and this will prove the result. 

Suppose that X n _^ k _ x ^ = and suppose that x\ is parsed into c 

distinct phrases, yi, » yc- Let v, be the index of the start of the 

/th phrase (i.e., yi = Xy! +1_1 ). For each / = 1, 2,..., c, define Si = 

Thus, Si is the k bits of x preceding yi ，Of course, = x \k-iy 

Let ci s be the number of phrases yi with length Z and preceding state 
Si = s for / = 1, 2,... and 5 e ?&. We then have 

° is — c 
l,s 

and 

〉 : lci s = n. 
l,s 

We now prove a surprising upper bound on the probability of a string 
based on the parsing of the string. 

Lemma 13.5.5 {Ziv y s inequality) For any distinct parsing (in particu¬ 
lar, the LZ78 parsing) of the string x\X 2 • • • we have 

log Qk(xi,x 2 , ...,x n \si) < -^Ci s log Cl s . 


Note that the right-hand side does not depend on Q^. 
Proof: We write 

Qk(x u x 2 , ...,x n \si) = Qk(y\,y 2 , ...,^ 1 ^ 1 ) 


(13.104) 


(13.105) 


(13.102) 

(13.103) 
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JP(yi\si) 


(13.106) 


or 


log Qk(xi,x 2 , ...,x n \si) = ^logPOlA) 


(13.107) 


EE logP(jiki) 

l,s i:\yi\=l,si=s 


(13.108) 


Tc ls T — logPCj/l^') (13.109) 


l,s i ： \yi\=l,Si=S 


Cls 


<^Q,log 


l,s 


ii ： \ yi \=l,Si=S 


Cls 


-P(yi\si ), 


(13.110) 


where the inequality follows from Jensen’s inequality and the concavity 
of the logarithm. 

Now since the yi are distinct, we have J2i-.\ yi \=i,si=s 尸 Cy/I 5 /) $ 1. Thus, 

^ 1 

logQ k (xux 2 ,...,x n \si) < > V Cls log —， （ 13.111) 

, Cls 

l,s 

proving the lemma. □ 


We can now prove the main theorem. 


Theorem 13.5.2 Let {X n } be a binary stationary ergodic process with 
entropy rate H(X) y and let c{n) be the number of phrases in a distinct 
parsing of a sample of length n from this process. Then 


lim sup 

n—oo 


c{n) \ogc{n) 
n 


<H{X) 


(13.112) 


with probability 1. 

Proof: We begin with Ziv’s inequality, which we rewrite as 

E CigC 

cis log —— (13.113) 

r 
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Cls 、 Ci s 


Writing 丌 “ = 亨 ， we have 

〉: ^ls — 


: log c — c ^ — log 


Is 


c c 


Is 

l,s 


n 

， 

c 


(13.114) 


(13.115) 


from (13.102) and (13.103). We now define random variables U, V such 
that 


Pr(U = l, V = s) = 7ii s . 

Thus, EU = - and 

log Qk(xi,x 2 , .. -,x n \si) < cH(U, V) - c log c 
or 

c c 

— log Qk(x\, X 2 ,..., x n \s\) > — log c - H{U , V). 

n n n 

Now 


(13.116) 


(13.117) 


(13.118) 


H(U, V) < H{U) + H{V) (13.119) 

and H(V) < log \X\ k = k. By Lemma 13.5.4, we have 

H{U) < (EU + l) log(EU + 1) - (EU) log(EU) (13.120) 

/n \ /n \ n n 

=(-+ 1) log (- + 1) - - log - (13.121) 

^ c / ^ c / c c 

n /n \ /c \ 

= log- + (- + ljlog (: + 1) • (13.122) 

Thus, 

-//([/, V) <-k + -log- + o(l). (13.123) 

n n n c 

For a given n, the maximum of ^ log ^ is attained for the maximum value 
of c (for I < ^). But from Lemma 13.5.3, c < j^(l + o(l)). Thus, 


c n 

— log — < O 
n c 


log log n 
logn 



(13.124) 
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and therefore ^H(U, V) ^ 0 as n ^ cx). Therefore, 
c(n) I 02 c(n) 1 

--- < -一 log Qk{x\,x 2 ,..., x n \si) + € k (n), (13.125) 

n n 

where €k{n) — > 0 as n —> oo. Hence, with probability 1, 

c(n) log c{n) 1 ^ v v lv o 、 

limsup - < lim —- log Q k (X u X 2 , ..., X n \X_, k _ X) ) 

n^oo n n^oo n y ) 

(13.126) 

= //(X 0 |X— 丄 ，…， D (13.127) 

-> H(X) as ^ oc. □ (13.128) 


We now prove that LZ78 coding is asymptotically optimal. 

Theorem 13.5.3 Let {X/}^ be a binary stationary ergodic stochastic 
process. Let l(X\, X 2 ,..., X n ) be the LZ78 codeword length associated 
with X\, X 2 ,..., X n . Then 

1 

lim sup X 2 , …， X n ) < H(X) with probability 1 , (13.129) 

n—00 ^ 

where H(X) is the entropy rate of the process. 

Proof: We have shown that l(X\, X 2 ,..., X n ) = c(n)(logc(n) + 1), 
where c(n) is the number of phrases in the LZ78 parsing of the 
string X\, X 2 , …， X n . By Lemma 13.5.3, limsupc(n)/n = 0, and thus 
Theorem 13.5.2 establishes that 

.. /(Xi, X 2 ,..., x n ) f c(n) log c(n) c(n) 

lim sup - = lim sup I - 1 - 

n \ n n 

< H(X) with probability 1. □ (13.130) 

Thus, the length per source symbol of the LZ78 encoding of an ergodic 
source is asymptotically no greater than the entropy rate of the source. 
There are some interesting features of the proof of the optimality of LZ78 
that are worth noting. The bounds on the number of distinct phrases 
and Ziv’s inequality apply to any distinct parsing of the string, not just 
the incremental parsing version used in the algorithm. The proof can be 
extended in many ways with variations on the parsing algorithm; for 
example, it is possible to use multiple trees that are context or state 
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dependent [218, 426]. Ziv’s inequality (Lemma 13.5.5) remains partic¬ 
ularly intriguing since it relates a probability on one side with a purely 
deterministic function of the parsing of a sequence on the other. 

The Lempel-Ziv codes are simple examples of a universal code (i.e., a 
code that does not depend on the distribution of the source). This code can 
be used without knowledge of the source distribution and yet will achieve 
an asymptotic compression equal to the entropy rate of the source. 


SUMMARY 


Ideal word length 


/*(x) = log 


1 


p ⑴ 


Average description length 

E p l\x) = H(p). 


(13.131) 


(13.132) 


Estimated probability distribution p(x). If l(x) = log then 

E p t(x) = H(p) + D{p\\p). (13.133) 

Average redundancy 


R P = E p l(X)- H(p). 
Minimax redundancy. For X 〜 pe(x )， 9 e 0, 



(13.134) 


(13.135) 


Minimax theorem. D* = C, where C is the capacity of the channel 

{0, PeM, X\. 

Bernoulli sequences. For X n 〜 Bernoulli(0)，the redundancy is 

D* = min m^x D(po(x n )\\q(x n )) ^ - logn + o(logn). (13.136) 

Q 0 Z 


Arithmetic coding. nH bits of F{x n ) reveal approximately n bits 
of x n . 
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Lempel-Ziv coding (recurrence time coding). Let R n {X n ) be the 
last time in the past that we have seen a block of n symbols X n . Then 
log R n H(X )，and encoding by describing the recurrence time is 
asymptotically optimal. 

Lempel-Ziv coding (sequence parsing). If a sequence is parsed into 
the shortest phrases not seen before (e.g., 011011101 is parsed to 
0,1,10,11,101,...) and l{x n ) is the description length of the parsed se¬ 
quence, then 

limsup —l(X n ) < H(X) with probability 1 (13.137) 

n 

for every stationary ergodic process {Xj}. 


PROBLEMS 

13.1 Minimax regret data compression and channel capacity . First 
consider universal data compression with respect to two source 
distributions. Let the alphabet V = {l, e, 0} and let p\(v) put mass 
1 — a on i; = 1 and mass a on v = e. Let P 2 (v) put mass 1 — a on 
0 and mass a on v = e. We assign word lengths to V according to 
l(v) = log the ideal codeword length with respect to a clev¬ 
erly chosen probability mass function p(v). The worst-case excess 
description length (above the entropy of the true distribution) is 

/ 1 1 \ 
m f x [ E Pi lo § - E Pi log J = max D( Pi || p). 

(13.138) 

Thus, the minimax regret is D* = min 尸 max z D(pi || p). 

(a) Find D*. 

(b) Find the p(v) achieving D*. 

(c) Compare Z)* to the capacity of the binary erasure channel 

l — a a 0 
0 a l — a 


and comment. 
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13.2 Universal data compression . Consider three possible source dis¬ 
tributions on X, 

P a = (0.7, 0.2, 0.1), P b = (0.1, 0.7, 0.2), P c = (0.2, 0.1, 0.7). 

(a) Find the minimum incremental cost of compression 

D* = min max D(Pq\\P), 

P 6 

the associated mass function P = p 2 , P 3 ), and ideal code¬ 

word lengths // = log(l/pi). 

(b) What is the channel capacity of a channel matrix with rows 

Pa.Pb.Pc^ 

13.3 Arithmetic coding. Let {X/}^ 0 be a stationary binary Markov 
chain with transition matrix 


Pij 


(13.139) 


Calculate the first 3 bits of F(X°°) = O.F\F 2 ... when X°° = 
1010111.... How many bits of X°° does this specify? 

13.4 Arithmetic coding • Let Xi be binary stationary Markov with 

2 -| 


transition matrix 


2 


13.5 


13.6 


(a) Find 尸 (01110) =Vx{.X x X 2 X 3i X A X 5 < .OHIO}. 

(b) How many bits .F\F 2 ... can be known for sure if it is not 
known how X = 01110 continues? 

and encoding of 


Lempel—Ziv • Give the LZ78 
00000011010100000110101 . 


parsing 


13.7 


Compression of constant sequence • We are given the constant 
sequence x n = 11111.... 

(a) Give the LZ78 parsing for this sequence. 

(b) Argue that the number of encoding bits per symbol for this 
sequence goes to zero as n ^ oc. 

Another idealized version of Lempel-Ziv coding. An idealized 
version of LZ was shown to be optimal: The encoder and decoder 
both have available to them the “infinite past” generated by the 
process, … ， X_i, Xo, and the encoder describes the string (X\, 
X 2 ,..., X n ) by telling the decoder the position R n in the past 


114 314 
314 114 
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of the first recurrence of that string. This takes roughly log R n + 
2 log log R n bits. Now consider the following variant: Instead of 
describing R n , the encoder describes R n -\ plus the last sym¬ 
bol, X n . From these two the decoder can reconstruct the string 
(_X"i ， . • • ， ^n)• 

(a) What is the number of bits per symbol used in this case to 
encode (X\, X 2 , •••, X n )l 

(b) Modify the proof given in the text to show that this version is 
also asymptotically optimal: namely, that the expected number 
of bits per symbol converges to the entropy rate. 

13.8 Length of pointers in LZ77. In the version of LZ77 due to Storer 
and Szymanski [507] described in Section 13.4.1, a short match 
can be represented by either (F ， P ， L) (flag, pointer, length) or 
by (F, C) (flag, character). Assume that the window length is W, 
and assume that the maximum match length is M. 

(a) How many bits are required to represent PI To represent LI 

(b) Assume that C, the representation of a character, is 8 bits 
long. If the representation of P plus L is longer than 8 bits, 
it would be better to represent a single character match as 
an uncompressed character rather than as a match within the 
dictionary. As a function of W and M, what is the shortest 
match that one should represent as a match rather than as 
uncompressed characters? 

(c) Let W = 4096 and M = 256. What is the shortest match that 
one would represent as a match rather than uncompressed 
characters? 

13.9 Lempel-Ziv 78. 

(a) Continue the Lempel-Ziv parsing of the sequence 

0 , 00 , 001 , 00000011010111 . 

(b) Give a sequence for which the number of phrases in the LZ 
parsing grows as fast as possible. 

(c) Give a sequence for which the number of phrases in the LZ 
parsing grows as slowly as possible. 

13.10 Two versions of fixed-database Lempel-Ziv. Consider a source 
(^l, P). For simplicity assume that the alphabet is finite |^4| = 
A < 00 and the symbols are i.i.d. 〜 P. A fixed database V is 
given and is revealed to the decoder. The encoder parses the tar¬ 
get sequence into blocks of length /, and subsequently encodes 
them by giving the binary description of their last appearance 
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in the database. If a match is not found, the entire block is 
sent uncompressed, requiring /logA bits. A flag is used to tell 
the decoder whether a match location is being described or the 
sequence itself. Parts (a) and (b) give some preliminaries you will 
need in showing the optimality of fixed-database LZ in part (c). 

(a) Let xi be a <5-typical sequence of length / starting at 0, and let 
Ri(x l ) be the corresponding recurrence index in the infinite 
past …， X- 2 , X—i. Show that 

E[R,(X')\X l =x l ] < 2 l(H+S) , 

where H is the entropy rate of the source. 

(b) Prove that for any 6 > 0, Pr(/?/(X / ) > 2 / ^ //+6) ) 0 as / ^ 

oc. (Hint: Expand the probability by conditioning on strings 
x l , and break things up into typical and nontypical. Markov’s 
inequality and the AEP should prove handy as well.) 

(c) Consider the following two fixed databases: (i) V\ is formed 
by taking all <5-typical /-vectors; and (ii) V: formed by taking 
the most recent L = 2 l( ^ H+ ^ symbols in the infinite past (i.e., 
X_t ， … ， X_i). Argue that the algorithm described above is 
asymptotically optimal: namely, that the expected number of 
bits per symbol converges to the entropy rate when used in 
conjunction with either database V\ or T> 2 . 

13.11 Tunstall coding • The normal setting for source coding maps a 
symbol (or a block of symbols) from a finite alphabet onto a variable- 
length string. An example of such a code is the Huffman code, which 
is the optimal (minimal expected length) mapping from a set of 
symbols to a prefix-free set of codewords. Now consider the dual 
problem of variable-to-fixed length codes, where we map a variable- 
length sequence of source symbols into a fixed-length binary (or 
£>-ary) representation. A variable-to-fixed length code for an i.i.d. 
sequence of random variables X\, X 2 , …， X n , Xi 〜 p(x), x e X 
={0, 1 ， . •. ， m — 1 }，is defined by a prefix-free set of phrases Ao C 
?C, where ?C is the set of finite-length strings of symbols of X, and 
\Ad\ = D. Given any sequence X 1 , X 2 ,..., X n , the string is parsed 
into phrases from A d (unique because of the prefix-free property of 
Ad) and represented by a sequence of symbols from a D-ary alpha¬ 
bet. Define the efficiency of this coding scheme by 


R(A d )= 


log£> 

EL(A d ) 


(13.140) 
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where EL(Ad) is the expected length of a phrase from Aj). 

(a) Prove that R(A D ) > H(X). 

(b) The process of constructing Ad can be considered as a process 
of constructing an m-ary tree whose leaves are the phrases in 
Ad- Assume that D = l + k{m — 1) for some integer k > 
Consider the following algorithm due to Tunstall: 

(i) Start with A = {0, 1,..., m — 1} with probabilities po, 
pi, … ， p m -\. This corresponds to a complete m-ary tree 
of depth 1. 

(ii) Expand the node with the highest probability. For ex¬ 

ample, if po is the node with the highest probability, the 
new set is A = {00, 01, , 0(m — 1), 1,..., (m — 1)}. 

(iii) Repeat step 2 until the number of leaves (number of 
phrases) reaches the required value. 

Show that the Tunstall algorithm is optimal in the sense that 
it constructs a variable to a fixed code with the best R(Ao) 
for a given D [i.e., the largest value of EL(Ad) for a given 
D]. 

(c) Show that there exists a D such that R(A* D ) < H(X) + 1. 


HISTORICAL NOTES 

The problem of encoding a source with an unknown distribution was 
analyzed by Fitingof [211] and Davisson [159], who showed that there 
were classes of sources for which the universal coding procedure was 
asymptotically optimal. The result relating the average redundancy of a 
universal code and channel capacity is due to Gallager [229] and Ryabko 
[450]. Our proof follows that of Csiszar. This result was extended to 
show that the channel capacity was the lower bound for the redundancy 
for “most” sources in the class by Merhav and Feder [387], extending the 
results obtained by Rissanen [444, 448] for the parametric case. 

The arithmetic coding procedure has its roots in the Shannon-Fano 
code developed by Elias (unpublished), which was analyzed by Jelinek 
[297]. The procedure for the construction of a prefix-free code described 
in the text is due to Gilbert and Moore [249]. Arithmetic coding itself was 
developed by Rissanen [441] and Pasco [414]; it was generalized by Lang- 
don and Rissanen [343]. See also the enumerative methods in Cover [120]. 
Tutorial introductions to arithmetic coding can be found in Langdon [342] 
and Witten et al. [564]. Arithmetic coding combined with the context-tree 
weighting algorithm due to Willems et al. [560, 561] achieve the Rissanen 
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lower bound [444] and therefore have the optimal rate of convergence to 
the entropy for tree sources with unknown parameters. 

The class of Lempel-Ziv algorithms was first described in the seminal 
papers of Lempel and Ziv [603, 604]. The original results were theoreti¬ 
cally interesting, but people implementing compression algorithms did not 
take notice until the publication of a simple efficient version of the algo¬ 
rithm due to Welch [554]. Since then, multiple versions of the algorithms 
have been described, many of them patented. Versions of this algorithm 
are now used in many compression products, including GIF files for image 
compression and the CCITT standard for compression in modems. The 
optimality of the sliding window version of Lempel-Ziv (LZ77) is due to 
Wyner and Ziv [575]. An extension of the proof of the optimality of LZ78 
[426] shows that the redundancy of LZ78 is on the order of l/log(n), 
as opposed to the lower bounds of \og(n)/n. Thus even though LZ78 
is asymptotically optimal for all stationary ergodic sources, it converges 
to the entropy rate very slowly compared to the lower bounds for finite- 
state Markov sources. However, for the class of all ergodic sources, lower 
bounds on the redundancy of a universal code do not exist, as shown by 
examples due to Shields [492] and Shields and Weiss [494]. A lossless 
block compression algorithm based on sorting the blocks and using simple 
run-length encoding due to Burrows and Wheeler [81] has been analyzed 
by Effros et al. [181]. Universal methods for prediction are discussed in 
Feder, Merhav and Gutman [204, 386, 388]. 


CHAPTER 14 


KOLMOGOROV COMPLEXITY 


The great mathematician Kolmogorov culminated a lifetime of research 
in mathematics, complexity, and information theory with his definition in 
1965 of the intrinsic descriptive complexity of an object. In our treatment 
so far, the object X has been a random variable drawn according to 
a probability mass function p(x). If X is random, there is a sense in 
which the descriptive complexity of the event X = x is log ^y, because 

「log is the number of bits required to describe x by a Shannon code. 
One notes immediately that the descriptive complexity of such an object 
depends on the probability distribution. 

Kolmogorov went further. He defined the algorithmic (descriptive) 
complexity of an object to be the length of the shortest binary com¬ 
puter program that describes the object. (Apparently, a computer, the 
most general form of data decompressor, will after a finite amount of 
computation, use this description to exhibit the object described.) Thus, 
the Kolmogorov complexity of an object dispenses with the probability 
distribution. Kolmogorov made the crucial observation that the definition 
of complexity is essentially computer independent. It is an amazing fact 
that the expected length of the shortest binary computer description of a 
random variable is approximately equal to its entropy. Thus, the shortest 
computer description acts as a universal code which is uniformly good 
for all probability distributions. In this sense, algorithmic complexity is a 
conceptual precursor to entropy. 

Perhaps a good point of view of the role of this chapter is to consider 
Kolmogorov complexity as a way to think. One does not use the shortest 
computer program in practice because it may take infinitely long to find 
such a minimal program. But one can use very short, not necessarily mini¬ 
mal programs in practice; and the idea of finding such short programs leads 
to universal codes, a good basis for inductive inference, a formalization 
of Occam’s razor (“The simplest explanation is best”）and to fundamental 
understanding in physics, computer science, and communication theory. 


Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A. Thomas 
Copyright © 2006 John Wiley & Sons, Inc. 
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Before formalizing the notion of Kolmogorov complexity, let us give 
three strings as examples: 

1. 0101010101010101010101010101010101010101010101010101010101010101 

2. 0110101000001001111001100110011111110011101111001100100100001000 

3 . 1101111001110101111101101111101110101101111000101110010100111011 

What are the shortest binary computer programs for each of these 
sequences? The first sequence is definitely simple. It consists of thirty- 
two 01’s. The second sequence looks random and passes most tests for 
randomness, but it is in fact the initial segment of the binary expansion 
of — 1. Again, this is a simple sequence. The third again looks ran¬ 
dom, except that the proportion of l’s is not near 各 We shall assume 
that it is otherwise random. It turns out that by describing the number 
A: of l’s in the sequence, then giving the index of the sequence in a 
lexicographic ordering of those with this number of Vs, one can give a 
description of the sequence in roughly logn + nH(^) bits. This again is 
substantially fewer than the n bits in the sequence. Again, we conclude 
that the sequence, random though it is, is simple. In this case, however, it 
is not as simple as the other two sequences, which have constant-length 
programs. In fact, its complexity is proportional to n. Finally, we can 
imagine a truly random sequence generated by pure coin flips. There are 
2 n such sequences and they are all equally probable. It is highly likely 
that such a random sequence cannot be compressed (i.e., there is no bet¬ 
ter program for such a sequence than simply saying “Print the following: 
0101100111010.. .0”). The reason for this is that there are not enough 
short programs to go around. Thus, the descriptive complexity of a truly 
random binary sequence is as long as the sequence itself. 

These are the basic ideas. It will remain to be shown that this notion of 
intrinsic complexity is computer independent (i.e., that the length of the 
shortest program does not depend on the computer). At first, this seems 
like nonsense. But it turns out to be true, up to an additive constant. And 
for long sequences of high complexity, this additive constant (which is 
the length of the preprogram that allows one computer to mimic the other) 
is negligible. 


14.1 MODELS OF COMPUTATION 

To formalize the notions of algorithmic complexity, we first discuss accept¬ 
able models for computers. All but the most trivial computers are univer¬ 
sal, in the sense that they can mimic the actions of other computers. 
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We touch briefly on a certain canonical universal computer, the universal 
Turing machine, the conceptually simplest universal computer. 

In 1936, Turing was obsessed with the question of whether the thoughts 
in a living brain could be held equally well by a collection of inani¬ 
mate parts. In short, could a machine think? By analyzing the human 
computational process, he posited some constraints on such a computer. 
Apparently, a human thinks, writes, thinks some more, writes, and so on. 
Consider a computer as a finite-state machine operating on a finite symbol 
set. (The symbols in an infinite symbol set cannot be distinguished in finite 
space.) A program tape, on which a binary program is written, is fed left 
to right into this finite-state machine. At each unit of time, the machine 
inspects the program tape, writes some symbols on a work tape, changes 
its state according to its transition table, and calls for more program. The 
operations of such a machine can be described by a finite list of tran¬ 
sitions. Turing argued that this machine could mimic the computational 
ability of a human being. 

After Turing’s work, it turned out that every new computational sys¬ 
tem could be reduced to a Turing machine, and conversely. In particular, 
the familiar digital computer with its CPU, memory, and input output 
devices could be simulated by and could simulate a Turing machine. This 
led Church to state what is now known as Church’s thesis, which states 
that all (sufficiently complex) computational models are equivalent in the 
sense that they can compute the same family of functions. The class of 
functions they can compute agrees with our intuitive notion of effectively 
computable functions, that is, functions for which there is a finite pre¬ 
scription or program that will lead in a finite number of mechanically 
specified computational steps to the desired computational result. 

We shall have in mind throughout this chapter the computer illustrated 
in Figure 14.1. At each step of the computation, the computer reads a 
symbol from the input tape, changes state according to its state transition 
table, possibly writes something on the work tape or output tape, and 


Input tape 

Finite 

Output tape 

P2P1 

State 

Machine 

XfX 2 … 












Work tape 

FIGURE 14.1. A Turing machine. 
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moves the program read head to the next cell of the program read tape. 
This machine reads the program from right to left only, never going back, 
and therefore the programs form a prefix-free set. No program leading to a 
halting computation can be the prefix of another such program. The restric¬ 
tion to prefix-free programs leads immediately to a theory of Kolmogorov 
complexity which is formally analogous to information theory. 

We can view the Turing machine as a map from a set of finite-length 
binary strings to the set of finite- or infinite-length binary strings. In 
some cases, the computation does not halt, and in such cases the value of 
the function is said to be undefined. The set of functions / : {0, 1}* —> 
{0, 1}* U {0, 1}°° computable by Turing machines is called the set of par¬ 
tial recursive functions. 


14.2 KOLMOGOROV COMPLEXITY: DEFINITIONS 
AND EXAMPLES 

Let x be a finite-length binary string and let be a universal computer. 
Let l(x) denote the length of the string x. Let U(p) denote the output of 
the computer U when presented with a program p. 

We define the Kolmogorov (or algorithmic) complexity of a string x 
as the minimal description length of x. 

Definition The Kolmogorov complexity Ku(x) of a string x with respect 
to a universal computer U is defined as 

K u {x) = min l(p )， (14.1) 

p:U(p)=x 


the minimum length over all programs that print x and halt. Thus, Ku(x) 
is the shortest description length of x over all descriptions interpreted by 
computer U. 

A useful technique for thinking about Kolmogorov complexity is the 
following — if one person can describe a sequence to another person in 
such a manner as to lead unambiguously to a computation of that sequence 
in a finite amount of time, the number of bits in that communication is an 
upper bound on the Kolmogorov complexity. For example, one can say 
“Print out the first 1,239,875,981,825,931 bits of the square root of eT 
Allowing 8 bits per character (ASCII), we see that the unambiguous 73- 
symbol program above demonstrates that the Kolmogorov complexity of 
this huge number is no greater than (8)(73) = 584 bits. Most numbers of 
this length (more than a quadrillion bits) have a Kolmogorov complexity 
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of nearly 1,239,875,981,825,931 bits. The fact that there is a simple algo¬ 
rithm to calculate the square root of e provides the saving in descriptive 
complexity. 

In the definition above, we have not mentioned anything about the 
length of x. If we assume that the computer already knows the length of 
x, we can define the conditional Kolmogorov complexity knowing l{x) as 

Ku{x\l{x)) = min l(p). (14.2) 

p:U(p,l(x))=x 

This is the shortest description length if the computer U has the length of 
x made available to it. 

It should be noted that Ku{x\y) is usually defined as Ku{x\y, y*), 
where is the shortest program for y. This is to avoid certain slight 
asymmetries, but we will not use this definition here. 

We first prove some of the basic properties of Kolmogorov complexity 
and then consider various examples. 

Theorem 14.2.1 (Universality of Kolmogorov complexity) If U is a 
universal computer，for any other computer A there exists a constant cj^ 
such that 

K u {x) < K A {x) + (14.3) 

for all strings x G {0, 1}*, and the constant does not depend on x. 

Proof: Assume that we have a program for computer A to print x. 
Thus, A(pa) = We can precede this program by a simulation program 
s^\ which tells computer U how to simulate computer A. Computer U 
will then interpret the instructions in the program for A, perform the 
corresponding calculations and print out x. The program for U is p = 
and its length is 

l(p) = Ksa) + I (PA) = CA + I (pa), (14.4) 

where is the length of the simulation program. Hence, 

Ku(x) = min l(p) < min (/(/?) + c A ) = K A (x) + c A (14.5) 

p:U(p)=x p:A(p)=x 

for all strings x. □ 


The constant cj^ in the theorem may be very large. For example, A may 
be a large computer with a large number of functions built into the system. 
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The computer U can be a simple microprocessor. The simulation program 
will contain the details of the implementation of all these functions, in 
fact, all the software available on the large computer. The crucial point is 
that the length of this simulation program is independent of the length of 
x, the string to be compressed. For sufficiently long x, the length of this 
simulation program can be neglected, and we can discuss Kolmogorov 
complexity without talking about the constants. 

If A and U are both universal, we have 

<C (14.6) 

for all x. Hence, we will drop all mention of Ziin all further definitions. We 
will assume that the unspecified computer Z^is a fixed universal computer. 

Theorem 14.2.2 {Conditional complexity is less than the length of the 
sequence) 

K{x\l{x)) < l(x) + c. (14.7) 

Proof: A program for printing x is 

Print the following /-bit sequence: x\X2 .. . 又 /⑷. 

Note that no bits are required to describe / since / is given. The program 
is self-delimiting because l{x) is provided and the end of the program is 
thus clearly defined. The length of this program is l(x) + c. □ 

Without knowledge of the length of the string, we will need an addi¬ 
tional stop symbol or we can use a self-punctuating scheme like the one 
described in the proof of the next theorem. 

Theorem 14.2.3 {Upper bound on Kolmogorov complexity) 

K(x) < K(x\l(x)) + 2logl(x) + c. (14.8) 

Proof: If the computer does not know Z(x), the method of Theorem 
14.2.2 does not apply. We must have some way of informing the com¬ 
puter when it has come to the end of the string of bits that describes 
the sequence. We describe a simple but inefficient method that uses a 
sequence 01 as a “comma.” 

Suppose that l{x) = n. To describe l(x), repeat every bit of the binary 
expansion of n twice; then end the description with a 01 so that the 
computer knows that it has come to the end of the description of n. 
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For example, the number 5 (binary 101) will be described as 11001101. 
This description requires 2 「 logn"| + 2 bits. Thus, inclusion of the binary 
representation of Z(x) does not add more than 21og/(x) + c bits to the 
length of the program, and we have the bound in the theorem. □ 

A more efficient method for describing n is to do so recursively. We 
first specify the number (log n) of bits in the binary representation of n 
and then specify the actual bits of n. To specify logn, the length of the 
binary representation of n, we can use the inefficient method (2 log log n) 
or the efficient method (log log n + •••)• If we use the efficient method at 
each level, until we have a small number to specify, we can describe n 
in log n + log log w + log log log n + • — bits, where we continue the sum 
until the last positive term. This sum of iterated logarithms is sometimes 
written log* n. Thus, Theorem 14.2.3 can be improved to 

K{x) < K{x\l{x)) + log* l(x) + c. (14.9) 

We now prove that there are very few sequences with low complexity. 

Theorem 14.2.4 {Lower bound on Kolmogorov complexity). The 
number of strings x with complexity K{x) < k satisfies 

\{x e {0, 1}* : K(x) < k}\ < 2 k . (14.10) 

Proof: There are not very many short programs. If we list all the pro¬ 
grams of length < k, we have 


k-i 

A , 0,1,00,01,10, 11,...,..., (14.11) 

y - V - ‘ V - V - ' 

1 2 4 2 k ~ l 

and the total number of such programs is 

l+2 + 4 + ... + 2* -1 =2 A .-1 <2*. (14.12) 

Since each program can produce only one possible output sequence, the 
number of sequences with complexity < k is less than 2 k . □ 

To avoid confusion and to facilitate exposition in the rest of this chapter, 
we shall need to introduce a special notation for the binary entropy func¬ 
tion 


Ho(p) = -p log p- {l- p) log(l - p). 


(14.13) 
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Thus, when we write Ho(^ Y^!i=\ 兄 )， we will mean —X n log X n — (1 — 
X n ) log(l — X n ) and not the entropy of random variable X n . When there 
is no confusion, we shall simply write H{p) for Hq(p). 

Now let us consider various examples of Kolmogorov complexity. The 
complexity will depend on the computer, but only up to an additive con¬ 
stant. To be specific, we consider a computer that can accept unambiguous 
commands in English (with numbers given in binary notation). We will 
use the inequality 


n - 2 « 尋 ） < r I < 


Sk(n — k) 


、kJ _ N 7rk{n — k) 


^nH{k/h) 


k # 0, n, 

(14.14) 


which is proved in Lemma 17.5.1. 


Example 14.2.1 (A sequence of n zeros) If we assume that the com¬ 
puter knows n, a short program to print this string is 


Print the specified number of zeros. 


The length of this program is a constant number of bits. This program 
length does not depend on n. Hence, the Kolmogorov complexity of this 
sequence is c, and 


欠 (000 … 0|n) = c for all n. (14.15) 

Example 14.2.2 (Kolmogorov complexity of jz) The first n bits of n 
can be calculated using a simple series expression. This program has a 
small constant length if the computer already knows n. Hence, 


K(jT\7t2 - - - 7C n \n) = c. 


(14.16) 


Example 14.2.3 (Gotham weather) Suppose that we want the com¬ 
puter to print out the weather in Gotham for n days. We can write a 
program that contains the entire sequence x = x\X2 - - - x n , where Xi = 1 
indicates rain on day i. But this is inefficient, since the weather is quite 
dependent. We can devise various coding schemes for the sequence to 
take the dependence into account. A simple one is to find a Markov 
model to approximate the sequence (using the empirical transition prob¬ 
abilities) and then code the sequence using the Shannon code for this 
probability distribution. We can describe the empirical Markov transitions 
in O(logn) bits and then use log bits to describe x, where p is the 
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specified Markov probability. Assuming that the entropy of the weather 
is ^ bit per day, we can describe the weather for n days using about n /5 
bits, and hence 

n 


K (Gotham weather|n) ^ — + O(logn) + c. (14.17) 

Example 14.2.4 (Repeating sequence of the form 01010101 •. .01) A 
short program suffices. Simply print the specified number of 01 pairs. 


Hence, 


尺 (010101010 ...0l\n) =c. (14.18) 

Example 14.2.5 (Fractal) A fractal is part of the Mandelbrot set and 
is generated by a simple computer program. For different points c in the 
complex plane, one calculates the number of iterations of the map z n -\-\ = 
z^ + c (starting with zo = 0) needed for \z\ to cross a particular threshold. 
The point c is then colored according to the number of iterations needed. 
Thus, the fractal is an example of an object that looks very complex but 
is essentially very simple. Its Kolmogorov complexity is essentially zero. 

Example 14.2.6 (Mona Lisa) We can make use of the many structures 
and dependencies in the painting. We can probably compress the image 
by a factor of 3 or so by using some existing easily described image 
compression algorithm. Hence, if n is the number of pixels in the image 
of the Mona Lisa, 

/^(Mona Lisa|n) < ^ + c. (14.19) 

Example 14.2.7 {Integer n) If the computer knows the number of bits 
in the binary representation of the integer, we need only provide the values 
of these bits. This program will have length c + logn. 

In general, the computer will not know the length of the binary repre¬ 
sentation of the integer. So we must inform the computer in some way 
when the description ends. Using the method to describe integers used 
to derive (14.9), we see that the Kolmogorov complexity of an integer is 
bounded by 


K(n) < log* n + c. (14.20) 

Example 14.2.8 (Sequence of n bits with k ones) Can we compress a 
sequence of n bits with k ones? 

Our first guess is no, since we have a series of bits that must be repro¬ 
duced exactly. But consider the following program: 
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Generate, in lexicographic order, all sequences with k ones; 
Of these sequences, print the /th sequence. 


This program will print out the required sequence. The only variables in 
the program are k (with known range {0, 1 ,..., n}) and i (with conditional 
range {1,2,..., (=)}). The total length of this program is 


Kp)= 


c + 


logn + 
to express k 



to express i 


< c f + logn + nH (|) - 士 log", 


(14.21) 


(14.22) 


since (^) < -j==2 nH °^ by (14.14) for p = k/n and q = l — p and k ^ 

0 and k ^ n. We have used logn bits to represent k. Thus, if YH=\ x i — 
then 


K(x u x 2 ,... ,x n \n) < nHo 



+ -logn + c. 


(14.23) 


We can summarize Example 14.2.8 in the following theorem. 

Theorem 14.2.5 The Kolmogorov complexity of a binary string x is 
bounded by 


K{x\X 2 • • • x n \n) < nHo 


Y, Xi 


+ - logn + c. 


Proof: Use the program described in Example 14.2.8. 


(14.24) 

□ 


Remark Let x e {0,1}* be the data that we wish to compress, and 
consider the program p to be the compressed data. We will have succeeded 
in compressing the data only if l(p) < Z(x), or 


K(x) < l(x). 


(14.25) 


In general, when the length l(x) of the sequence x is small, the constants 
that appear in the expressions for the Kolmogorov complexity will over¬ 
whelm the contributions due to l(x). Hence, the theory is useful primarily 
when l(x) is very large. In such cases we can safely neglect the terms 
that do not depend on l{x). 
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14.3 KOLMOGOROV COMPLEXITY AND ENTROPY 

We now consider the relationship between the Kolmogorov complexity of 
a sequence of random variables and its entropy. In general, we show that 
the expected value of the Kolmogorov complexity of a random sequence 
is close to the Shannon entropy. First, we prove that the program lengths 
satisfy the Kraft inequality. 

Lemma 14.3.1 For any computer U ， 

2 ~ l(p) < !• (14.26) 

p:U(p)ha\ts 

Proof: If the computer halts on any program, it does not look any further 
for input. Hence, there cannot be any other halting program with this 
program as a prefix. Thus, the halting programs form a prefix-free set, 
and their lengths satisfy the Kraft inequality (Theorem 5.2.1). 

We now show that y i EK{X n \n) ^ H(X) for i.i.d. processes with a 
finite alphabet. 

Theorem 14.3.1 (Relationship of Kolmogorov complexity and entropy) 
Let the stochastic process {X/} be drawn i.i.d. according to the probability 
mass function f{x) y x e X, where X is a finite alphabet. Let f(x n )= 
YYi=\ f( x i). Then there exists a constant c such that 

1 ^ M (IA1 - l)logn c 

H(X) <-) f(x n )K(x n \n) < H(X) + ——— - (14.27) 
n n n 

x n 

for all n. Consequently ， 

E-K{X n \n) H(X). (14.28) 

n 

Proof: Consider the lower bound. The allowed programs satisfy the pre¬ 
fix property, and thus their lengths satisfy the Kraft inequality. We assign 
to each x n the length of the shortest program p such that U(p, n) = x n . 
These shortest programs also satisfy the Kraft inequality. We know from 
the theory of source coding that the expected codeword length must be 
greater than the entropy. Hence, 

[ f(x n )K(x n \n) > H(X U X 2 . X n ) = nH(X). (14.29) 

X n 
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We first prove the upper bound when X is binary (i.e., X\, X 2 ,..., X n 
are i.i.d • 〜 Bemoulli(0)). Using the method of Theorem 14.2.5, we can 
bound the complexity of a binary string by 


欠 ( 尤 1 尤 2 .. .x n \n) < nHo l — j + - log n + c. (14.30) 


Hence, 


EK(XiX 2 ...X n \n) < nEH 0 \-J2 Xi ) + ~ lo ^ n + c (14.31) 


(a) 


< nH 0 l-J^EXi\+-logn + c (14.32) 


nHo(6) + - logn + c, 


(14.33) 


where (a) follows from Jensen’s inequality and the concavity of the 
entropy. Thus, we have proved the upper bound in the theorem for binary 
processes. 

We can use the same technique for the case of a nonbinary finite alpha¬ 
bet. We first describe the type of the sequence (the empirical frequency 
of occurrence of each of the alphabet symbols as defined in Section 11.1) 
using (|A1 — 1) logn bits (the frequency of the last symbol can be cal¬ 
culated from the frequencies of the rest). Then we describe the index 
of the sequence within the set of all sequences having the same type. 
The type class has less than 2 nI1 、 p x n 、 elements (where P x n is the type 
of the sequence x n ) as shown in Chapter 11, and therefore the two-stage 
description of a string x n has length 

K(x n \n) <nH{P x n) + {\X\ - l)\ogn + c. (14.34) 

Again, taking the expectation and applying Jensen’s inequality as in the 
binary case, we obtain 

EK{X n \n) < nH(X) + {\X\ - 1) logn + c. (14.35) 

Dividing this by n yields the upper bound of the theorem. □ 
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Removing the conditioning on the length of the sequence is straight¬ 
forward. By similar arguments, we can show that 

H(X) <-T f(x n )K(x n ) < H(X) + (闪 + 1)l0gn + - (14.36) 

n n n 

x n 

for all n. The lower bound follows from the fact that K{x n ) is also a 
prefix-free code for the source, and the upper bound can be derived from 
the fact that K{x n ) < K{x n \n) + 21ogn + c. Thus, 

E-K(X n ) H(X), (14.37) 

n 

and the compressibility achieved by the computer goes to the entropy 
limit. 


14.4 KOLMOGOROV COMPLEXITY OF INTEGERS 

In Section 14.3 we defined the Kolmogorov complexity of a binary string 
as the length of the shortest program for a universal computer that prints 
out that string. We can extend that definition to define the Kolmogorov 
complexity of an integer to be the Kolmogorov complexity of the corre¬ 
sponding binary string. 

Definition The Kolmogorov complexity of an integer n is defined as 

K{n) = min l(p). (14.38) 

p\U(p)=n 

The properties of the Kolmogorov complexity of integers are very sim¬ 
ilar to those of the Kolmogorov complexity of bit strings. The following 
properties are immediate consequences of the corresponding properties 
for strings. 

Theorem 14.4.1 For universal computers A and U, 

K u (n) < K A (n) + c A . (14.39) 

Also, since any number can be specified by its binary expansion, we 
have the following theorem. 


Theorem 14.4.2 


K(n) < log* n + c. 


(14.40) 
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Theorem 14.4.3 There are an infinite number of integers n such that 
K {n) > log 仏 

Proof: We know from Lemma 14.3.1 that 

二 2 - 尤⑻ (14.41) 

n 

and 

J^2- lo ^ n =J^- = oo. (14.42) 

n n 

But if K(n) < \ogn for all n > no, then 

oo oo 

2 ~ K(n) > 2_log, ' = °°， (14.43) 

n=riQ n=riQ 

which is a contradiction. □ 


14.5 ALGORITHMICALLY RANDOM AND INCOMPRESSIBLE 
SEQUENCES 

From the examples in Section 14.2, it is clear that there are some long 
sequences that are simple to describe, like the first million bits of n. By 
the same token, there are also large integers that are simple to describe, 
such as 

or (100!)!. 

We now show that although there are some simple sequences, most 
sequences do not have simple descriptions. Similarly, most integers are 
not simple. Hence, if we draw a sequence at random, we are likely to 
draw a complex sequence. The next theorem shows that the probability 
that a sequence can be compressed by more than k bits is no greater than 
2 ~ k . 

Theorem 14.5.1 Let X\, X 2 , •. ” X n be drawn according to a Bernoulli 
(\) process. Then 


P (K(X { X 2 ... x n \n) <n-k) < 2~ k . 


( 14 . 44 ) 


Proof: 


14.5 ALGORITHMICALLY RANDOM AND INCOMPRESSIBLE SEQUENCES 477 


P(K(X l X 2 ...X n \n) <n-k) 

= > : (xi, X2, , Xu) 

x\X2---x n \ K{x\X2---x n \n)<n—k 

(14.45) 

= J2 2_ n (14.46) 

x\X2---x n : K{x\X2---x n \n)<n—k 

=\{x\X 2 .. .x n : K(x\X 2 .. -x n \n) < n — k}\ 2 ~ n 

< 2 n ~ k 2~ n (by Theorem 14.2.4) (14.47) 

= 2~ k . (14.48) 

□ 

Thus, most sequences have a complexity close to their length. For 
example, the fraction of sequences of length n that have complexity less 
than ^2 — 5 is less than 1/32. This motivates the following definition: 

Definition A sequence x\, X 2 ,..., x n is said to be algorithmically ran¬ 
dom if 


K{x\X 2 .. .x n \n) > n. (14.49) 

Note that by the counting argument, there exists, for each n, at least 
one sequence x n such that 


K(x n \n) > n. 

Definition We call an infinite string x incompressible if 


lim 

n^-oo 


K(x 1 X 2 X 3 
n 


•x n \n) 


(14.50) 


(14.51) 


Theorem 14.5.2 (Strong law of large numbers for incompressible se¬ 
quences) If a string X\X 2 ...is incompressible, it satisfies the law of large 
numbers in the sense that 


n 



2 


(14.52) 


Hence the proportions of 0，s and l，s in any incompressible string are 
almost equal. 
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Proof: Let 0 n = ^ ^1=1 x i denote the proportion of Vs in x\X 2 ... 
x n . Then using the method of Example 14.2, one can write a program 
of length riHo{6 n ) + 2\og{nO n ) + c to print x n . Thus, 


K(x n \n) 

n 


< Ho(6 n ) + 2 


log n c' 

- 1 - 

n n 


(14.53) 


By the incompressibility assumption, we also have the lower bound for 
large enough n. 


e<^l<H Q{ 0 n) + 2^ + C - 

n n n 


(14.54) 


Thus, 


2 log n + c f 

H 0 (O n ) > 1 --- 


(14.55) 


Inspection of the graph of Ho(p) (Figure 14.2) shows that 9 n is close to 
\ for large n. Specifically, the inequality above implies that 


0 n e 


(\ 1 

\2~ 8n ， 2 +8n 



(14.56) 



FIGURE 14.2. H 0 (p) vs. p. 
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where 8 n is chosen so that 


Ho 



2\ogn + c n + c' 
n 


(14.57) 


which implies that S n 0 as n —> oo. Thus, ^ n —> oo. O 


We have now proved that incompressible sequences look random in the 
sense that the proportion of 0’s and l’s are almost equal. In general, we 
can show that if a sequence is incompressible, it will satisfy all computable 
statistical tests for randomness. (Otherwise, identification of the test that x 
fails will reduce the descriptive complexity of x, yielding a contradiction.) 
In this sense, the algorithmic test for randomness is the ultimate test, 
including within it all other computable tests for randomness. 

We now prove a related law of large numbers for the Kolmogorov 
complexity of Bemoulli(0) sequences. The Kolmogorov complexity of 
a sequence of binary random variables drawn i.i.d. according to a 
Bernoulli(0) process is close to the entropy Ho{6). In Theorem 14.3.1 
we proved that the expected value of the Kolmogorov complex¬ 
ity of a random Bernoulli sequence converges to the entropy [i.e., 
E^K(X\X 2 ... X n \n) Ho(6)]. Now we remove the expectation. 

Theorem 14.5.3 Let X\, X 2 , • •., X n be drawn i.i.d. 〜 Bernoulli6). 
Then 

—K{X\X 2 ... X n \n) Hq( 9) in probability. (14.58) 

n 

Proof: Let X n = ^^2 the proportion of Vs in X\, X 2 ,..., X n . 
Then using the method described in (14.23), we have 

K{X x X 2 ...X n \n) < nHo (X„) +2\ogn + c, (14.59) 


and since by the weak law of large numbers, X n ^ 6 in probability, we 
have 


Pr 


l -K (Xl X 2 ...X n] n)-Hm>e 


0 . 


(14.60) 


Conversely, we can bound the number of sequences with complexity sig¬ 
nificantly lower than the entropy. From the AEP, we can divide the set 
of sequences into the typical set and the nontypical set. There are at 
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least (1 — sequences in the typical set. At most 之 ”⑽ 0 ⑹ _c ) 

of these typical sequences can have a complexity less than n(Ho(9) — c). 
The probability that the complexity of the random sequence is less than 
n(H 0 (9)-c) is 

Pr(K(X n \n) < n(H 0 (6)-c)) 

< Pr(X n ^ A^) + Pr(X n e A^\ K(X n \n) < n(H 0 (6) - c)) 

< € + ^ p(x n ) (14.61) 

x n eAi n) ,K(x n \n)<n(H 0 (0)-c) 

<€+ J2 2 - n{Ho(e) - €) (14.62) 

x n eAi n \K(x n \n)<n(H o (0)-c) 

< ^ _l_ 2"( 付 0 ⑹一 c )2 — ⑹一 € ) (14 63) 

=e + 2- n{c ~ e \ (14.64) 

which is arbitrarily small for appropriate choice of €, n, and c. Hence with 
high probability, the Kolmogorov complexity of the random sequence is 
close to the entropy, and we have 

K(Xi,X 2 ,...,X n \n) 

一 —— 丨 -> H 0 (9) in probability. (14.65) 

n 

□ 

14.6 UNIVERSAL PROBABILITY 

Suppose that a computer is fed a random program. Imagine a monkey 
sitting at a keyboard and typing the keys at random. Equivalently, feed a 
series of fair coin flips into a universal Turing machine. In either case, most 
strings will not make sense to the computer. If a person sits at a terminal 
and types keys at random, he will probably get an error message (i.e., the 
computer will print the null string and halts). But with a certain probability 
she will hit on something that makes sense. The computer will then print 
out something meaningful. Will this output sequence look random? 

From our earlier discussions, it is clear that most sequences of length n 
have complexity close to n. Since the probability of an input program p 
is 2— 1 ⑻， shorter programs are much more probable than longer ones; and 
when they produce long strings, shorter programs do not produce random 
strings; they produce strings with simply described structure. 

The probability distribution on the output strings is far from uniform. 
Under the computer-induced distribution, simple strings are more likely 
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than complicated strings of the same length. This motivates us to define 
a universal probability distribution on strings as follows: 

Definition The universal probability of a string x is 

Puix) = Y. 2 -办 ） = Pr 似 (p) = X )， (14.66) 

p\U{p)=x 

which is the probability that a program randomly drawn as a sequence of 
fair coin flips p\, p 2 , ... will print out the string x. 

This probability is universal in many senses. We can consider it as the 
probability of observing such a string in nature; the implicit belief is that 
simpler strings are more likely than complicated strings. For example, if 
we wish to describe the laws of physics, we might consider the simplest 
string describing the laws as the most likely. This principle, known as 
Occam’s Razor, has been a general principle guiding scientific research 
for centuries: If there are many explanations consistent with the observed 
data, choose the simplest. In our framework, Occam’s Razor is equivalent 
to choosing the shortest program that produces a given string. 

This probability mass function is called universal because of the fol¬ 
lowing theorem. 

Theorem 14.6.1 For every computer A, 

PuM > c r A P A {x) (14.67) 

for every string x G {0, 1}*, where the constant c' A depends only on U and 
A. 


Proof: From the discussion of Section 14.2, we recall that for every 
program p r for A that prints x, there exists a program p for U of length 
not more than l(p’）+ produced by prefixing a simulation program for 
A. Hence, 

A(X)= E 2_/(P) ^ E ^~ l{p，) ~ CA = c' a Pa(x). (14.68) 

p:U(p)=x p'-.Aip^x 


□ 


Any sequence drawn according to a computable probability mass func¬ 
tion on binary strings can be considered to be produced by some computer 
A acting on a random input (via the probability inverse transformation 
acting on a random input). Hence, the universal probability distribution 
includes a mixture of all computable probability distributions. 


482 KOLMOGOROV COMPLEXITY 


Remark {Bounded likelihood ratio). In particular, Theorem 14.6.1 guar¬ 
antees that a likelihood ratio test of the hypothesis that X is drawn 
according to Pu versus the hypothesis that it is drawn according to 
P 4 will have bounded likelihood ratio. If U and A are universal, the 
Pu{x) /is bounded away from 0 and infinity for all x. This is in 
contrast to other simple hypothesis-testing problems (like Bemoulli(0i) 
versus Bernoulli^)), where the likelihood ratio goes to 0 or oc as the 
sample size goes to infinity. Apparently, Pu, which is a mixture of all 
computable distributions, can never be rejected completely as the true 
distribution of any data drawn according to some computable probability 
distribution. In that sense we cannot reject the possibility that the universe 
is the output of monkeys typing at a computer. However, we can reject 
the hypothesis that the universe is random (monkeys with no computer). 

In Section 14.11 we prove that 

Puix) ^ 2~ k{x \ (14.69) 

thus showing that K(x) and log ⑴ have equal status as universal algo¬ 
rithmic complexity measures. This is especially interesting since log - p J ⑴ 
is the ideal codeword length (the Shannon codeword length) with respect 
to the universal probability distribution Pu(x). 

We conclude this section with an example of a monkey at a typewriter 
vs. a monkey at a computer keyboard. If the monkey types at random 
on a typewriter, the probability that it types out all the works of Shake¬ 
speare (assuming that the text is 1 million bits long) is 2 -1,000,000 . If the 
monkey sits at a computer terminal, however, the probability that it types 
out Shakespeare is now 2 — 尺 (Shakespeare) 义 2 -250,000 , which although 
extremely small is still exponentially more likely than when the monkey 
sits at a dumb typewriter. 

The example indicates that a random input to a computer is much more 
likely to produce “interesting” outputs than a random input to a typewriter. 
We all know that a computer is an intelligence amplifier. Apparently, it 
creates sense from nonsense as well. 


14.7 THE HALTING PROBLEM AND THE NONCOMPUTABILITY 
OF KOLMOGOROV COMPLEXITY 

Consider the following paradoxical statement: 

This statement is false. 

This paradox is sometimes stated in a two-statement form: 
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The next statement is false. 

The preceding statement is true. 

These paradoxes are versions of what is called the Epimenides liar para¬ 
dox, and it illustrates the pitfalls involved in self-reference. In 1931, Godel 
used this idea of self-reference to show that any interesting system of 
mathematics is not complete; there are statements in the system that are 
true but that cannot be proved within the system. To accomplish this, he 
translated theorems and proofs into integers and constructed a statement 
of the above form, which can therefore not be proved true or false. 

The halting problem in computer science is very closely connected 
with Godel’s incompleteness theorem. In essence, it states that for any 
computational model, there is no general algorithm to decide whether a 
program will halt or not (go on forever). Note that it is not a statement 
about any specific program. Quite clearly, there are many programs that 
can easily be shown to halt or go on forever. The halting problem says 
that we cannot answer this question for all programs. The reason for this 
is again the idea of self-reference. 

To a practical person, the halting problem may not be of any immediate 
significance, but it has great theoretical importance as the dividing line 
between things that can be done on a computer (given unbounded memory 
and time) and things that cannot be done at all (such as proving all true 
statements in number theory). Godel’s incompleteness theorem is one of 
the most important mathematical results of the twentieth century, and its 
consequences are still being explored. The halting problem is an essential 
example of G6del’s incompleteness theorem. 

One of the consequences of the nonexistence of an algorithm for the 
halting problem is the noncomputability of Kolmogorov complexity. The 
only way to find the shortest program in general is to try all short programs 
and see which of them can do the job. However, at any time some of 
the short programs may not have halted and there is no effective (finite 
mechanical) way to tell whether or not they will halt and what they will 
print out. Hence, there is no effective way to find the shortest program to 
print a given string. 

The noncomputability of Kolmogorov complexity is an example of the 
Berry paradox . The Berry paradox asks for the shortest number not name- 
able in under 10 words. A number like 1,101,121 cannot be a solution 
since the defining expression itself is less than 10 words long. This illus¬ 
trates the problems with the terms nameable and describable .， they are 
too powerful to be used without a strict meaning. If we restrict ourselves 
to the meaning “can be described for printing out on a computer,” we can 
resolve Berry’s paradox by saying that the smallest number not describable 
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in less than 10 words exists but is not computable. This “description” is 
not a program for computing the number. E. F. Beckenbach pointed out 
a similar problem in the classification of numbers as dull or interesting; 
the smallest dull number must be interesting. 

As stated at the beginning of the chapter, one does not really anticipate 
that practitioners will find the shortest computer program for a given 
string. The shortest program is not computable, although as more and more 
programs are shown to produce the string, the estimates from above of 
the Kolmogorov complexity converge to the true Kolmogorov complexity. 
(The problem, of course, is that one may have found the shortest program 
and never know that no shorter program exists.) Even though Kolmogorov 
complexity is not computable, it provides a framework within which to 
consider questions of randomness and inference. 

14.8 

In this section we introduce Chaitin’s mystical, magical number, Q, which 
has some extremely interesting properties. 

Definition 


n = E 2~ l{p \ (14.70) 

p ： U(p) halts 


Note that Q = Pr(U(p) halts), the probability that the given universal 
computer halts when the input to the computer is a binary string drawn 
according to a Bernoulli(^) process. 

Since the programs that halt are prefix-free, their lengths satisfy the 
Kraft inequality, and hence the sum above is always between 0 and 1. Let 
Q n = .cl)\(jl )2 . • • co n denote the first n bits of Q. The properties of Q are 
as follows: 

1. Q, is noneomputable. There is no effective (finite, mechanical) way 
to check whether arbitrary programs halt (the halting problem), so 
there is no effective way to compute Q. 

2. Q is a “philosopher’s stone’’. Knowing Q to an accuracy of n 
bits will enable us to decide the truth of any provable or finitely 
refutable mathematical theorem that can be written in less than n 
bits. Actually, all that this means is that given n bits of Q, there 
is an effective procedure to decide the truth of n-bit theorems; the 
procedure may take an arbitrarily long (but finite) time. Of course, 
without knowing it is not possible to check the truth or falsity of 
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every theorem by an effective procedure (GodeFs incompleteness 
theorem). 

The basic idea of the procedure using n bits of Q is simple: We 
run all programs until the sum of the masses 2 — ’⑻ contributed 
by programs that halt equals or exceeds Q n = O.CO 1 CO 2 - - - co n , the 
truncated version of Q that we are given. Then, since 

Q - Q n < 2~ n , (14.71) 

we know that the sum of all further contributions of the form 2 _/ ^ 
to Q from programs that halt must also be less than 2~ n . This implies 
that no program of length < n that has not yet halted will ever halt, 
which enables us to decide the halting or nonhalting of all programs 
of length < n. 

To complete the proof, we must show that it is possible for a com¬ 
puter to run all possible programs in “parallel” in such a way that 
any program that halts will eventually be found to halt. First, list all 
possible programs, starting with the null program, A: 

A,0,1,00,01, 10, 11, 000, 001, 010, Oil,.... (14.72) 

Then let the computer execute one clock cycle of A for the first 
cycle. In the next cycle, let the computer execute two clock cycles 
of A and two clock cycles of the program 0. In the third cycle, let 
it execute three clock cycles of each of the first three programs, and 
so on. In this way, the computer will eventually run all possible 
programs and run them for longer and longer times, so that if any 
program halts, it will eventually be discovered to halt. The com¬ 
puter keeps track of which program is being executed and the cycle 
number so that it can produce a list of all the programs that halt. 
Thus, we will ultimately know whether or not any program of less 
than n bits will halt. This enables the computer to find any proof 
of the theorem or a counterexample to the theorem if the theorem 
can be stated in less than n bits. Knowledge of Q turns previously 
improvable theorems into provable theorems. Here Q acts as an 
oracle. 

Although Q seems magical in this respect, there are other numbers 
that carry the same information. For example, if we take the list of 
programs and construct a real number in which the /th bit indicates 
whether program i halts, this number can also be used to decide 
any finitely refutable question in mathematics. This number is very 
dilute (in information content) because one needs approximately 2 n 
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bits of this indicator function to decide whether or not an n-bii 
program halts. Given 2 n bits, one can tell immediately without any 
computation whether or not any program of length less than n halts. 
However, Q is the most compact representation of this information 
since it is algorithmically random and incompressible. 

What are some of the questions that we can resolve using OP. 
Many of the interesting problems in number theory can be stated 
as a search for a counterexample. For example, it is straightforward 
to write a program that searches over the integers x, y, z, and n 
and halts only if it finds a counterexample to Fermat’s last theorem, 
which states that 


x n +y n = z n (14.73) 


has no solution in integers for n >3. Another example is Goldbach’s 
conjecture, which states that any even number is a sum of two 
primes. Our program would search through all the even numbers 
starting with 2, check all prime numbers less than it and find a 
decomposition as a sum of two primes. It will halt if it comes across 
an even number that does not have such a decomposition. Knowing 
whether this program halts is equivalent to knowing the truth of 
Goldbach’s conjecture. 

We can also design a program that searches through all proofs 
and halts only when it finds a proof of the theorem required. This 
program will eventually halt if the theorem has a finite proof. Hence 
knowing n bits of Q, we can find the truth or falsity of all theorems 
that have a finite proof or are finitely refutable and which can be 
stated in less than n bits. 

3. Q, is algorithmically random. 

Theorem 14.8.1 Q cannot be compressed by more than a constant; 
that is，there exists a constant c such that 

K{co\(i )2 ... (o n ) > n — c for all n. (14.74) 

Proof: We know that if we are given n bits of Q，we can determine 
whether or not any program of length < n halts. Using K(co\cl )2 --- 
co n ) bits, we can calculate n bits of Q, and then we can generate a list 
of all programs of length < n that halt, together with their corresponding 
outputs. We find the first string xo that is not on this list. The string xo 
is then the shortest string with Kolmogorov complexity K(xq) > n. The 
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complexity of this program to print is K(^2 n ) + c, which must be at 
least as long as the shortest program for xo. Consequently, 

K(Q n ) + c> K(x 0 ) > n (14.75) 

for all n. Thus, K ((O 1 CO 2 - - - co n ) > n — c, and Q cannot be compressed by 
more than a constant. □ 

14.9 UNIVERSAL GAMBLING 

Suppose that a gambler is asked to gamble sequentially on sequences 
x e {0, 1}*. He has no idea of the origin of the sequence. He is given 
fair odds (2-for-l) on each bit. How should he gamble? If he knew the 
distribution of the elements of the string, he might use proportional betting 
because of its optimal growth-rate properties, as shown in Chapter 6. If he 
believes that the string occurred naturally, it seems intuitive that simpler 
strings are more likely than complex ones. Hence, if he were to extend 
the idea of proportional betting, he might bet according to the universal 
probability of the string. For reference, note that if the gambler knows the 
string x in advance, he can increase his wealth by a factor of 2 1 ^ simply 
by betting all his wealth each time on the next symbol of x. Let the wealth 
5(x) associated with betting scheme b(x )， = 1， be given by 

S(x) = 2 l(x) b(x). (14.76) 

Suppose that the gambler bets b(x) = 2~ K ^ on a string x. This betting 
strategy can be called universal gambling. We note that the sum of the 
bets 

= J2 2 ~ K(x) < 2 ~ 1{P) = ^ < (14.77) 

x x p :/7 halts 

and he will not have used all his money. For simplicity, let us assume 
that he throws the rest away. For example, the amount of wealth resulting 
from a bet fe(0110) on a sequence x = 0110 is 2’( x )fc(x) = 2 4 Z?(0110) plus 
the amount won on all bets Z?(0110 ...) on sequences that extend x. 

Then we have the following theorem: 

Theorem 14.9.1 The logarithm of the wealth a gambler achieves on a 
sequence using universal gambling plus the complexity of the sequence is 
no smaller than the length of the sequence, or 


\ogS(x) + K(x) > l(x). 


( 14 . 78 ) 
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Remark This is the counterpart of the gambling conservation theorem 
W* + H = log m from Chapter 6. 

Proof: The proof follows directly from the universal gambling scheme, 
b(x) = 2~ k ^ x \ since 

S(x) = 2 l(x) b{x') > 2 ,(x) 2~ k{x \ (14.79) 

x'^x 

where x f ^ x means that x is a prefix of x f . Taking logarithms establishes 
the theorem. □ 

The result can be understood in many ways. For infinite sequences x 
with finite Kolmogorov complexity, 

S(xi Xi)> 2 l ~ K(x) = 2 l ~ c (14.80) 

for all /. Since 2 1 is the most that can be won in / gambles at fair odds, 
this scheme does asymptotically as well as the scheme based on knowing 
the sequence in advance. For example, if x = jt\7T2 • • • 7r n • • •, the digits 
in the expansion of n, the wealth at time n will be S n = S(x n ) > 2 n ~ c 
for all n. 

If the string is actually generated by a Bernoulli process with parameter 
p, then 

S(Xi ...X n )> 2«- w «o(^»)-21ogn-c ^ 2 „(1-拘⑻ -2 年一 I), ( 1481 ) 

which is the same to first order as the rate achieved when the gambler 
knows the distribution in advance, as in Chapter 6. 

From the examples we see that the universal gambling scheme on a 
random sequence does asymptotically as well as a scheme that uses prior 
knowledge of the true distribution. 

14.10 OCCAM'S RAZOR 

In many areas of scientific research, it is important to choose among 
various explanations of data observed. After choosing the explanation, 
we wish to assign a confidence level to the predictions that ensue from 
the laws that have been deduced. For example, Laplace considered the 
probability that the sun will rise again tomorrow given that it has risen 
every day in recorded history. Laplace’s solution was to assume that the 
rising of the sun was a Bernoulli(0) process with unknown parameter 0. 
He assumed that 0 was uniformly distributed on the unit interval. Using 


14.10 OCCAM'S RAZOR 


489 


the observed data, he calculated the posterior probability that the sun will 
rise again tomorrow and found that it was 


P(X„ +1 = l\X n = h A—i = 1， … ， Xi = 1) 

_ P(X n+l = l,X n = l, X n _i = l,...,X l = l) 

_ P(X n = l,X n . 1 = l,...,X 1 = l) 

fp o n+i de 
Io endd 

n + 1 

n + 2’ 


(14.82) 

(14.83) 


which he put forward as the probability that the sun will rise on day n + l 
given that it has risen on days 1 through n. 

Using the ideas of Kolmogorov complexity and universal probability, 
we can provide an alternative approach to the problem. Under the universal 
probability, let us calculate the probability of seeing a 1 next after having 
observed n Vs in the sequence so far. The conditional probability that 
the next symbol is a 1 is the ratio of the probability of all sequences 
with initial segment \ n and next bit equal to 1 to the probability of all 
sequences with initial segment \ n . The simplest programs carry most of 
the probability; hence we can approximate the probability that the next 
bit is a 1 with the probability of the program that says “Print l’s forever.” 
Thus, 


p{l n \y) « = c>0. (14.84) 


Estimating the probability that the next bit is 0 is more difficult. Since any 
program that prints l n 0 ... yields a description of n, its length should at 
least be K(n )，which for most n is about \ogn + (9 (log log n)，and hence 
ignoring second-order terms, we have 


V p(rov) ^ p{\ n o) ^ 2 _logn 


Hence, the conditional probability of observing a 0 next is 

p(ro) l 


^(oin 




p(l n 0) + cn + l 


(14.85) 


(14.86) 


which is similar to the result p(0|l n ) = l/(n + 1) derived by Laplace. 
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This type of argument is a special case of Occam’s Razor, a general 
principle governing scientific research, weighting possible explanations by 
their complexity. William of Occam said “Nunquam ponenda est pluralitas 
sine necesitate ”： Explanations should not be multiplied beyond necessity 
[516]. In the end, we choose the simplest explanation that is consistent 
with the data observed. For example, it is easier to accept the general 
theory of relativity than it is to accept a correction factor of c/r 3 to the 
gravitational law to explain the precession of the perihelion of Mercury, 
since the general theory explains more with fewer assumptions than does 
a “patched” Newtonian theory. 

14.11 KOLMOGOROV COMPLEXITY AND UNIVERSAL 
PROBABILITY 

We now prove an equivalence between Kolmogorov complexity and uni¬ 
versal probability. We begin by repeating the basic definitions. 


K(x) 
Pu(x)-- 


- min l(p) 

p:U(p)=x 

2 ，)_ 

p:U(p)=x 


Theorem 14.11.1 {Equivalence of K(x) and log 〜 1 ⑴ ) .) 
a constant c, independent of x，such that 


2~ K(X) < P u (x) < cT 


K(x) 


(14.87) 

(14.88) 

There exists 

(14.89) 


for all strings x. Thus，the universal probability of a string x is determined 
essentially by its Kolmogorov complexity. 

Remark This implies that K{x) and log ⑴ have equal status as uni¬ 
versal complexity measures, since 


K(x) -c' < log 


PM 


< K(x). 


(14.90) 


Recall that the complexity defined with respect to two different computers 
Ku and are essentially equivalent complexity measures if \Ku{x) — 
Kw{x)\ is bounded. Theorem 14.11.1 shows that Ku{x) and log 丨⑴ are 
essentially equivalent complexity measures. 

Notice the striking similarity between the relationship of K(x) and 


log 


p u( x 


in Kolmogorov complexity and the relationship of H(X) and 


log in information theory. The ideal Shannon code length assignment 
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l(x) = log achieves an average description length H(X), while in 
Kolmogorov complexity theory, the ideal description length log ⑴ is 
almost equal to K(X). Thus, log is the natural notion of descriptive 
complexity of x in algorithmic as well as probabilistic settings. 

The upper bound in (14.90) is obvious from the definitions, but the 
lower bound is more difficult to prove. The result is very surprising, since 
there are an infinite number of programs that print x. From any program 
it is possible to produce longer programs by padding the program with 
irrelevant instructions. The theorem proves that although there are an 
infinite number of such programs, the universal probability is essentially 
determined by the largest term, which is 2~ K( ^ X \ If Pu(x) is large, K(x) 
is small, and vice versa. 

However, there is another way to look at the upper bound that makes 
it less surprising. Consider any computable probability mass function on 
strings p{x). Using this mass function, we can construct a Shannon — Fano 
code (Section 5.9) for the source and then describe each string by the 
corresponding codeword, which will have length log Hence, for 
any computable distribution, we can construct a description of a string 
using not more than log + c bits, which is an upper bound on the 
Kolmogorov complexity K{x). Even though Pu(x) is not a computable 
probability mass function, we are able to finesse the problem using the 
rather involved tree construction procedure described below. 

Proof: {of Theorem 14.11.1). The first inequality is simple. Let p* be 
the shortest program for x. Then 

P u {x) = 2 ~ l(P) ^ 2”）= 2~ k{x \ (14.91) 

p:U(p)=x 

as we wished to show. 

We can rewrite the second inequality as 

K(x)<log—^— + c. (14.92) 

Pu(x) 

Our objective in the proof is to find a short program to describe the strings 
that have high Pu(x). An obvious idea is some kind of Huffman coding 
based on Pu(x), but Pu(x) cannot be calculated effectively, hence a proce¬ 
dure using Huffman coding is not implementable on a computer. Similarly, 
the process using the Shannon-Fano code also cannot be implemented. 
However, if we have the Shannon-Fano code tree, we can reconstruct the 
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string by looking for the corresponding node in the tree. This is the basis 
for the following tree construction procedure. 

To overcome the problem of noncomputability of Pu(x), we use a mod¬ 
ified approach, trying to construct a code tree directly. Unlike Huffman 
coding, this approach is not optimal in terms of minimum expected code¬ 
word length. However, it is good enough for us to derive a code for which 
each codeword for x has a length that is within a constant of log p J (v) . 

Before we get into the details of the proof, let us outline our approach. 
We want to construct a code tree in such a way that strings with high 
probability have low depth. Since we cannot calculate the probability of a 
string, we do not know a priori the depth of the string on the tree. Instead, 
we assign x successively to the nodes of the tree, assigning x to nodes 
closer and closer to the root as our estimate of Pu(x) improves. We want 
the computer to be able to recreate the tree and use the lowest depth node 
corresponding to the string x to reconstruct the string. 

We now consider the set of programs and their corresponding outputs 
{(p, x)}. We try to assign these pairs to the tree. But we immediately 
come across a problem — there are an infinite number of programs for a 
given string, and we do not have enough nodes of low depth. However, 
as we shall show, if we trim the list of program-output pairs, we will be 
able to define a more manageable list that can be assigned to the tree. 
Next, we demonstrate the existence of programs for x of length log ⑴ . 

Tree construction procedure: For the universal computer U, we simulate 
all programs using the technique explained in Section 14.8. We list all 
binary programs: 

A,0,1,00,01, 10, 11, 000, 001, 010, Oil,.... (14.93) 

Then let the computer execute one clock cycle of A for the first stage. 
In the next stage, let the computer execute two clock cycles of A and 
two clock cycles of the program 0. In the third stage, let the computer 
execute three clock cycles of each of the first three programs, and so on. 
In this way, the computer will eventually run all possible programs and 
run them for longer and longer times, so that if any program halts, it will 
be discovered to halt eventually. We use this method to produce a list 
of all programs that halt in the order in which they halt, together with 
their associated outputs. For each program and its corresponding output, 
(pic, Xk), we calculate n 众 ， which is chosen so that it corresponds to the 
current estimate of Pu(x). Specifically, 

l0g Pu(Xk) 


nk = 


(14.94) 
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where 

Pu(x k ) = - (14.95) 

(Pi ， xi):xi=x k ,i<k 

Note that 个 Pu(x) on the subsequence of times k such that Xk = x. 

We are now ready to construct a tree. As we add to the list of triplets, 
(Pk, Xk，nk), of programs that halt, we map some of them onto nodes of 
a binary tree. For purposes of the construction, we must ensure that all 
the ni’s corresponding to a particular Xk are distinct. To ensure this, we 
remove from the list all triplets that have the same x and n as a previous 
triplet. This will ensure that there is at most one node at each level of the 
tree that corresponds to a given x. 

Let {(〆 ， jc-, n’O : = 1, 2, 3, …} denote the new list. On the winnowed 
list, we assign the triplet {p f k , x' k , n r k ) to the first available node at level 
n f k + l. As soon as a node is assigned, all of its descendants become 
unavailable for assignment. (This keeps the assignment prefix-free.) 

We illustrate this by means of an example: 


n\) = (10111, 1110, 5), = 5 because Pu(^i) 

(P2,x2,n 2 ) = (11, 10, 2), n 2 = 2 because Pu(X2) 

(ps, X3, n^) = (0, 1110, 1 )， "3 = 1 because Pu(x^) 


(/?4, X4, n^) = (1010, 1111,4 )， "4 = 4 because Pu (X4) 

(P5,x 5 ,n 5 ) = (101101, 1110, 1), n 5 = 1 because Pu(x 5 ) 

(P6, X6,n 6 ) = (100, 1, 3), n 6 = 3 because Pu(x 6 ) 


> 2 ~ l(pi) = 2- 5 
> 2~ l(P2) = 2-2 

> 2 — + 2 一 办 1 ) 
= 2- 5 + 2 - 1 

> 2- 1 

> 2 - /(M) = 2- 4 
> 2一 1 + 2—5 +2—5 



> 2— /( 外）= 2— 3 


(14.96) 

We note that the string x = (1110) appears in positions 1, 3 and 5 in 
the list, but The estimate of the probability Pk(1110) has not 

jumped sufficiently for (p 5 , n^) to survive the cut. Thus the winnowed 

list becomes 

(〆，» (10111 ， 1110, 5 )， 

(/4 ， <V 2 ) = (ii,io,2 )， 

(々» (0,1110,1)， 

(^,^,<) = (1010,1111,4), ( 14 . 97 ) 

(09 = (100,1,3 )， 


The assignment of the winnowed list to nodes of the tree is illustrated in 
Figure 14.3. 
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In the example, we are able to find nodes at level + 1 to which 
we can assign the triplets. Now we shall prove that there are always 
enough nodes so that the assignment can be completed. We can perform 
the assignment of triplets to nodes if and only if the Kraft inequality is 
satisfied. 

We now drop the primes and deal only with the winnowed list illustrated 
in (14.97). We start with the infinite sum in the Kraft inequality and split 
it according to the output strings: 

oo 

^2 _(ni+1) = ^ ^ 2~ (nk+1) . (14.98) 

k=l xg{ 0 , 1 }* k:xk=x 

We then write the inner sum as 

2~ {nk+l) = 2~ 1 ^2 2 ~ nk (14.99) 

k ： Xk=X k\Xk=X 
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_ 2 — 吵邶 〜 ⑴」 2 


(14.102) 

(14.103) 


< Pu(x), 


where (14.100) is true because there is at most one node at each level 
that prints out a particular x. More precisely, the n^s on the winnowed 
list for a particular output string x are all different integers. Hence, 


Y^2~ {nk+l) 2 - (nk+1) < ^2 PwW S 1 ， (14.104) 


x k:xk=x 


and we can construct a tree with the nodes labeled by the triplets. 

If we are given the tree constructed above, it is easy to identify a given 
x by the path to the lowest depth node that prints x. Call this node p. 
(By construction, l(p) < log 尸 」⑴ + 2.) To use this tree in a program 
to print x, we specify p and ask the computer to execute the foregoing 
simulation of all programs. Then the computer will construct the tree as 
described above and wait for the particular node p to be assigned. Since 
the computer executes the same construction as the sender, eventually the 
node p will be assigned. At this point, the computer will halt and print 
out the x assigned to that node. 

This is an effective (finite, mechanical) procedure for the computer to 
reconstruct x. However, there is no effective procedure to find the lowest 
depth node corresponding to x. All that we have proved is that there is 
an (infinite) tree with a node corresponding to x at level「log 〜 1 ⑴ + 1. 
But this accomplishes our purpose. 

With reference to the example, the description of x = 1110 is the path 
to the node (p 3 , x^, n^) (i.e., 01), and the description of x = 1111 is the 
path 00001. If we wish to describe the string 1110, we ask the computer 
to perform the (simulation) tree construction until node 01 is assigned. 
Then we ask the computer to execute the program corresponding to node 
01 (i.e., /? 3 ). The output of this program is the desired string, x = 1110. 

The length of the program to reconstruct x is essentially the length of 
the description of the position of the lowest depth node p corresponding 
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to x in the tree. The length of this program for x is l{p) + c, where 


m < 


log 


PM 


+ 1， 


and hence the complexity of x satisfies 


K(x) < 


log 


Pu(x) 


+ c. 


(14.105) 


(14.106) 


14.12 KOLMOGOROV SUFFICIENT STATISTIC 

Suppose that we are given a sample sequence from a Bemoulli(0) process. 
What are the regularities or deviations from randomness in this sequence? 
One way to address the question is to find the Kolmogorov complexity 
K(x n \n), which we discover to be roughly nHo(6) + logn + c. Since, 
for 0 ^ this is much less than n, we conclude that x n has structure 
and is not randomly drawn Bernoulli(^). But what is the structure? The 
first attempt to find the structure is to investigate the shortest program p* 
for x n . But the shortest description of p* is about as long as p* itself; 
otherwise, we could further compress the description of x n , contradicting 
the minimality of p*. So this attempt is fruitless. 

A hint at a good approach comes from an examination of the way in 
which p* describes x n . The program “The sequence has k l’s; of such 
sequences, it is the /th” is optimal to first order for Bernoulli(0) sequences. 
We note that it is a two-stage description, and all of the structure of the 
sequence is captured in the first stage. Moreover, x n is maximally complex 
given the first stage of the description. The first stage, the description of 
k, requires log(n + 1) bits and defines a set 5 = {x e {0, 1 } W : [x/ = k). 
The second stage requires log |5| = log (^) ^ nHo(x n ) ^ nHo(0) bits and 
reveals nothing extraordinary about x n . 

We mimic this process for general sequences by looking for a simple 
set S that contains x n . We then follow it with a brute-force description of 
x n in S using log |5| bits. We begin with a definition of the smallest set 
containing x n that is describable in no more than k bits. 

Definition The Kolmogorov structure function Kf c (x n \n) of a binary 
string x G {0, l} n is defined as 

K k (x n \n) = min log|5|. (14.107) 

P : l(p) < k 
U(p, n) = S 
jc w G 5 c {0, \} n 
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The set S is the smallest set that can be described with no more than 
k bits and which includes x n . By U(p, n) = S, we mean that running the 
program p with data n on the universal computer U will print out the 
indicator function of the set S. 


Definition For a given small constant c, let /:* be the least k such that 


K k (x n \n) + k< K{x n \n) + c. (14.108) 

Let 5** be the corresponding set and let /?** be the program that prints out 
the indicator function of 5**. Then we shall say that p** is a Kolmogorov 
minimal sufficient statistic for x n . 

Consider the programs p* describing sets 5* such that 

K k (x n \n) + k = K{x n \n). (14.109) 

All the programs p* are “sufficient statistics” in that the complexity of 
x n given 5* is maximal. But the minimal sufficient statistic is the shortest 
“sufficient statistic.” 

The equality in the definition above is up to a large constant depending 
on the computer U. Then k* corresponds to the least k for which the two- 
stage description of x n is as good as the best single-stage description of 
x n . The second stage of the description merely provides the index of x n 
within the set 5**; this takes Kk{x n \n) bits if x n is conditionally maximally 
complex given the set 5**. Hence the set 5** captures all the structure 
within x n . The remaining description of x n within 5** is essentially the 
description of the randomness within the string. Hence 5** or 〆* is called 
the Kolmogorov sufficient statistic for x n . 

This is parallel to the definition of a sufficient statistic in mathematical 
statistics. A statistic T is said to be sufficient for a parameter 6 if the 
distribution of the sample given the sufficient statistic is independent of 
the parameter; that is, 


9 T(X) X (14.110) 


forms a Markov chain in that order. For the Kolmogorov sufficient statistic, 
the program p** is sufficient for the “structure” of the string x n \ the 
remainder of the description of x n is essentially independent of the “struc¬ 
ture” of x n . In particular, x n is maximally complex given 5**. 

A typical graph of the structure function is illustrated in Figure 14.4. 
When k = 0, the only set that can be described is the entire set {0, l} n . 
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so that the corresponding log set size is n. As we increase k, the size of 
the set drops rapidly until 

k + K k {x n \n) ^ K{x n \n). (14.111) 

After this, each additional bit of k reduces the set by half, and we pro¬ 
ceed along the line of slope — 1 until k = K(x n \n). For k > K(x n \n), the 
smallest set that can be described that includes x n is the singleton {x n }, 
and hence Kk{x n \n) = 0. 

We will now illustrate the concept with a few examples. 

1. Bernoulli9) sequence. Consider a sample of length n from a 
Bernoulli sequence with an unknown parameter 9. As discussed in 
Example 14.2, we can describle this sequence with nH (^) + j logn 
bits using a two stage description where we describe k in the first 
stage (using logn bits) and then describe the sequence within all 
sequences with k ones (using log ⑺ bits). However, we can use an 
even shorter first stage description. Instead of describing k exactly, 
we divide the range of k into bins and describe k only to an accu¬ 
racy of using ^ logn bits. Then we describe the actual 
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sequence among all sequences whose type is in the same bin as k. 
The size of the set of all sequences with / ones, Z e 众士 y is 

+ o(n) by Stirling’s formula, so the total description length 
is still + ^ logn + o(n), but the description length of the 

Kolmogorov sufficient statistics is k* ^ ^logn. 

2. Sample from a Markov chain • In the same vein as the preceding 
example, consider a sample from a first-order binary Markov chain. 
In this case again, /?** will correspond to describing the Markov type 
of the sequence (the number of occurrences of 00’s, 01’s ，10 ’s，and 
ll’s in the sequence); this conveys all the structure in the sequence. 
The remainder of the description will be the index of the sequence 
in the set of all sequences of this Markov type. Hence, in this case, 
k* w 2(| logn) = logn, corresponding to describing two elements 
of the conditional joint type to appropriate accuracy. (The other 
elements of the conditional joint type can be determined from these 
two.) 

3. Mona Lisa. Consider an image that consists of a gray circle on a 
white background. The circle is not uniformly gray but Bernoulli 
with parameter 6. This is illustrated in Figure 14.6. In this case, the 
best two-stage description is first to describe the size and position of 
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the circle and its average gray level and then to describe the index of 
the circle among all the circles with the same gray level. Assuming 
an n-pixel image (of size ^J~n by s/n), there are about n + l possible 
gray levels, and there are about (v^) 3 distinguishable circles. Hence, 
k* ^ j log n in this case. 

14.13 MINIMUM DESCRIPTION LENGTH PRINCIPLE 

A natural extension of Occam’s razor occurs when we need to describe 
data drawn from an unknown distribution. Let X\, X2,..., X n be drawn 
i.i.d. according to probability mass function p{x). We assume that we 
do not know p(x), but know that p(x) G P, a class of probability mass 
functions. Given the data, we can estimate the probability mass function in 
V that best fits the data. For simple classes V (e.g., if V has only finitely 
many distributions), the problem is straightforward, and the maximum 
likelihood procedure [i.e., find p eV that maximizes p(X\, X2 ,..., X n )] 
works well. However, if the class V is rich enough, there is a problem 
of overfitting the data. For example, if X\, X2, …， X n are continuous 
random variables, and if V is the set of all probability distributions, the 
maximum likelihood estimator given X\, X2, ..., is a distribution that 
places a single mass point of weight ^ at each observed value. Clearly, this 
estimate is too closely tied to actual observed data and does not capture 
any of the structure of the underlying distribution. 

To get around this problem, various methods have been applied. In the 
simplest case, the data are assumed to come from some parametric distri¬ 
bution (e.g.，the normal distribution), and the parameters of the distribution 
are estimated from the data. To validate this method, the data should be 
tested to check whether the distribution “looks” normal, and if the data 
pass the test, we could use this description of the data. A more general 
procedure is to take the maximum likelihood estimate and smooth it out 
to obtain a smooth density. With enough data, and appropriate smoothness 
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conditions, it is possible to make good estimates of the original density. 
This process is called kernel density estimation. 

However, the theory of Kolmogorov complexity (or the Kolmogorov 
sufficient statistic) suggests a different procedure: Find the p e V that 
minimizes 

L P (X U X n ) = K(p) +log 1 —— (14.112) 

p(Xi,X 2 ,...,X n ) 


This is the length of a two-stage description of the data, where we first 
describe the distribution p and then, given the distribution, construct the 
Shannon code and describe the data using log 尸⑻ J ——— bits. This pro¬ 
cedure is a special case of what is termed the minimum description length 
(MDL) principle: Given data and a choice of models, choose the model 
such that the description of the model plus the conditional description of 
the data is as short as possible. 


SUMMARY 

Definition. The Kolmogorov complexity K(x) of a string x is 

K{x) = min l(p) (14.113) 

p:U(p)=x 

欠 (x|/(x)) = min l(p). (14.114) 

p.M(p,l(x))=x 

Universality of Kolmogorov complexity. There exists a universal 
computer U such that for any other computer A, 

K u {x) < K A (x) + (14.115) 

for any string x, where the constant cj^ does not depend on x. If U and 
A are universal, \Ku{x) — Kj^{x)\ < c for all x. 

Upper bound on Kolmogorov complexity 

K(x\l(x)) < l(x) + c (14.116) 

K(x) < K(x\l(x)) + 2\ogl(x) + c. (14.117) 
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Kolmogorov complexity and entropy. If X\, X 2 , ... are i.i.d. integer¬ 
valued random variables with entropy H ， there exists a constant c such 
that for all n, 

1 „ log n c 

H < -EK(X n \n) < H + |A]—^ (14.118) 

n n n 

Lower bound on Kolmogorov complexity. There are no more than 
2 人 strings x with complexity K{x) < k. If X\, X 2 ,..., X n are drawn 
according to a Bernoulli ( 臺） process, 

Pr (K(XiX 2 ...X n \n) <n-k) < 2~ k . (14.119) 

Definition A sequence x is said to be incompressible if 
K{x\X 2 .. .x n \n)/n 1. 


Strong law of large numbers for incompressible sequences 

K(xux 2 , …， x n ) 
n 

Definition The universal probability of a string x is 

Pu(x) = J2 2 - 办） =PrO^p) = X). 
p:U(p)=x 



(14.120) 


(14.121) 


Universality of Pu(x). For every computer A, 

PM t c A P A {x) (14.122) 


for every string x G {0, 1}*, where the constant depends only on U 
and A. 

Definition Q = ^2 p . U( ^ p) halts 2 _/(/7) = Pr(U(p) halts) is the proba¬ 
bility that the computer halts when the input p to the computer is a 
binary string drawn according to a Bernoulli ( 去） process. 

Properties of S2 

1. Q is not computable • 

2. Q is a “philosopher’s stone”• 

3. Q is algorithmically random (incompressible). 
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Equivalence of K(x) and log ( P ^g ). There exists a constant c inde¬ 
pendent of x such that 

log-l--^(x) <c (14.123) 

Pu(x) 

for all strings x. Thus, the universal probability of a string x is essen¬ 
tially determined by its Kolmogorov complexity. 

Definition The Kolmogorov structure function Kk{x n \n) of a binary 
string x n G {0, l} n is defined as 

Kk{x n \n) = min log |5|. (14.124) 

P : Kp) < k 
U{p, n) = S 
x e S 

Definition Let k* be the least k such that 

K k *(x n \n) + e = K(x n \n). (14.125) 

Let 5** be the corresponding set and let p** be the program that prints 
out the indicator function of 5**. Then /?** is the Kolmogorov minimal 
sufficient statistic for x. 


PROBLEMS 


14.1 Kolmogorov complexity of two sequences. Let x, y e {0, 1}*. 
Argue that K(x, y) < K(x) + K(y) + c. 

14.2 Complexity of the sum 

(a) Argue that K (n) < log n + 2 log log n + c. 

(b) Argue that K{n\ + 112 ) < K{n\) + K{ri 2 ) + c. 

(c) Give an example in which n \ and ri 2 are complex but the sum 
is relatively simple. 

14.3 Images• Consider an n x n array x of 0’s and Vs . Thus, x has 
n 2 bits. 
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Find the Kolmogorov complexity K(x \ n) (to first order) if: 

(a) x is a horizontal line. 

(b) x is a square. 

(c) x is the union of two lines, each line being vertical or hori¬ 
zontal. 

14.4 Do computers reduce entropy? Feed a random program P into 
an universal computer. What is the entropy of the corresponding 
output? Specifically, let X = U{P), where 尸 is a Bernoulli(^) 
sequence. Here the binary sequence X is either undefined or is 
in {0, 1}*. Let H(X) be the Shannon entropy of X. Argue that 
H(X) = oc. Thus, although the computer turns nonsense into 
sense, the output entropy is still infinite. 

14.5 Monkeys on a computer. Suppose that a random program is 
typed into a computer. Give a rough estimate of the probability 
that the computer prints the following sequence: 

(a) 0 n followed by any arbitrary sequence. 

(b) 丌 1 丌 2 • •• 丌 followed by any arbitrary sequence, where 丌 / is 
the /th bit in the expansion of n. 

(c) 0 n l followed by any arbitrary sequence. 

(d) co\(i )2 • • - (o n followed by any arbitrary sequence. 

(e) A proof of the four-color theorem. 

14.6 Kolmogorov complexity and ternary programs • Suppose that the 
input programs for a universal computer U are sequences in 
{0, 1 ， 2}* (ternary inputs). Also, suppose that U prints ternary out¬ 
puts. Let K{x\l{x)) = mm.u^ p ^ x )) =x l(p). Show that: 

(a) K{x n \n) < n + c. 

(b) \x n e {0,1}* : K(x n \n) < k\ < 3 k . 

14.7 Law of large numbers . Using ternary inputs and outputs as in 
Problem 14.14.6, outline an argument demonstrating that if a 
sequence x is algorithmically random [i.e., if K(x\l(x)) ^ Z(x)], 
the proportion of 0’s, l’s, and 2’s in x must each be near It 
may be helpful to use Stirling’s approximation n\ ^ {n/e) n . 
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Image complexity. Consider two binary subsets A and B (of an 
n x n grid): for example, 



Find general upper and lower bounds, in terms of K(A\n) and 
K(B\n), for: 

(a) K(A c \n). 

(b) K(AUB\n). 

(c) K(ADB\n). 

14.9 Random program. Suppose that a random program (symbols 
i.i.d. uniform over the symbol set) is fed into the nearest available 
computer. To our surprise the first n bits of the binary expansion 
of l/\/2 are printed out. Roughly what would you say the proba¬ 
bility is that the next output bit will agree with the corresponding 
bit in the expansion of 1/ V2 ? 

14.10 Face-vase illusion 



(a) What is an upper bound on the complexity of a pattern on an 
m x m grid that has mirror-image symmetry about a vertical 
axis through the center of the grid and consists of horizontal 
line segments? 

(b) What is the complexity K if the image differs in one cell 
from the pattern described above? 

14.11 Kolmogorov complexity. Assume that n is very large and known. 

Let all rectangles be parallel to the frame. 
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(a) What is the (maximal) Kolmogorov complexity of the union 
of two rectangles on an n x n grid? 


(b) What if the rectangles intersect at a corner? 


(c) What if they have the same (unknown) shape? 

(d) What if they have the same (unknown) area? 

(e) What is the minimum Kolmogorov complexity of the union 
of two rectangles? That is, what is the simplest union? 

(f) What is the (maximal) Kolmogorov complexity over all images 
(not necessarily rectangles) on an n x n grid? 

14.12 Encrypted text. Suppose that English text x n is encrypted into 
y n by a substitution cypher: a 1-to-l reassignment of each of the 
27 letters of the alphabet (A-Z, including the space character) 
to itself. Suppose that the Kolmogorov complexity of the text 
x n is K(x n ) = (This is about right for English text. We’re 
now assuming a 27-symbol programming language, instead of a 
binary symbol-set for the programming language. So, the length 
of the shortest program, using a 27-ary programming language, 
that prints out a particular string of English text of length n, is 
approximately n/4.) 

(a) What is the Kolmogorov complexity of the encryption map? 
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(b) Estimate the Kolmogorov complexity of the encrypted text 

(c) How high must n be before you would expect to be able to 
decode y n l 

14.13 Kolmogorov complexity . Consider the Kolmogorov complexity 
K{n) over the integers n. If a specific integer n\ has a low Kol¬ 
mogorov complexity K{n\), by how much can the Kolmogorov 
complexity K{n\ + k) for the integer n\ + k vary from K{n\)l 

14.14 Complexity of large numbers. Let A{n) be the set of positive 
integers x for which a terminating program p of length less than 
or equal to n bits exists that outputs x. Let B{n) be the com¬ 
plement of A{n) [i.e., B{n) is the set of integers x for which no 
program of length less than or equal to n outputs x]. Let M{n) 
be the maximum integer in A(n), and let S{n) be the minimum 
integer in B(n). What is the Kolmogorov complexity K{M{n)) 
(approximately)? What is K{S{n)) (approximately)? Which is 
larger (M(n) or S(n))l Give a reasonable lower bound on M(n) 
and a reasonable upper bound on S(n). 


HISTORICAL NOTES 


The original ideas of Kolmogorov complexity were put forth indepen¬ 
dently and almost simultaneously by Kolmogorov [321, 322], Solomonoff 
[504], and Chaitin [89]. These ideas were developed further by students of 
Kolmogorov such as Martin-Lof [374], who defined the notion of algorith¬ 
mically random sequences and algorithmic tests for randomness, and by 
Levin and Zvonkin [353], who explored the ideas of universal probability 
and its relationship to complexity. A series of papers by Chaitin [90]-[92] 
developed the relationship between algorithmic complexity and mathemat¬ 
ical proofs. C. P. Schnorr studied the universal notion of randomness and 
related it to gambling in [466]-[468]. 

The concept of the Kolmogorov structure function was defined by Kol¬ 
mogorov at a talk at the Tallin conference in 1973, but these results 
were not published. V’yugin pursues this in [549], where he shows that 
there are some very strange sequences x n that reveal their structure arbi¬ 
trarily slowly in the sense that (x n \n) = n — k，k < K(x n \n). Zurek 
[606]-[608] addresses the fundamental questions of Maxwell’s demon 
and the second law of thermodynamics by establishing the physical con¬ 
sequences of Kolmogorov complexity. 
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Rissanen’s minimum description length (MDL) principle is very close 
in spirit to the Kolmogorov sufficient statistic. Rissanen [445, 446] finds a 
low-complexity model that yields a high likelihood of the data. Barron and 
Cover [32] argue that the density minimizing K (/) + log Y\f(x ) yields 
consistent density estimation. 

A nontechnical introduction to the various measures of complexity can 
be found in a thought-provoking book by Pagels [412]. Additional refer¬ 
ences to work in the area can be found in a paper by Cover et al. [114] on 
Kolmogorov’s contributions to information theory and algorithmic com¬ 
plexity. A comprehensive introduction to the field, including applications 
of the theory to analysis of algorithms and automata, may be found in the 
book by Li and Vitanyi [354]. Additional coverage may be found in the 
books by Chaitin [86, 93]. 



CHAPTER 15 

NETWORK INFORMATION 
THEORY 


A system with many senders and receivers contains many new elements 
in the communication problem: interference, cooperation, and feedback. 
These are the issues that are the domain of network information theory. 
The general problem is easy to state. Given many senders and receivers 
and a channel transition matrix that describes the effects of the interference 
and the noise in the network, decide whether or not the sources can be 
transmitted over the channel. This problem involves distributed source 
coding (data compression) as well as distributed communication (finding 
the capacity region of the network). This general problem has not yet been 
solved, so we consider various special cases in this chapter. 

Examples of large communication networks include computer networks, 
satellite networks, and the phone system. Even within a single computer, 
there are various components that talk to each other. A complete theory 
of network information would have wide implications for the design of 
communication and computer networks. 

Suppose that m stations wish to communicate with a common satellite 
over a common channel, as shown in Figure 15.1. This is known as a 
multiple-access channel. How do the various senders cooperate with each 
other to send information to the receiver? What rates of communication are 
achievable simultaneously? What limitations does interference among the 
senders put on the total rate of communication? This is the best understood 
multiuser channel, and the above questions have satisfying answers. 

In contrast, we can reverse the network and consider one TV station 
sending information to m TV receivers, as in Figure 15.2. How does 
the sender encode information meant for different receivers in a common 
signal? What are the rates at which information can be sent to the different 
receivers? For this channel, the answers are known only in special cases. 

Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A. Thomas 
Copyright © 2006 John Wiley & Sons, Inc. 
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There are other channels, such as the relay channel (where there is 
one source and one destination, but one or more intermediate sender- 
receiver pairs that act as relays to facilitate the communication between the 
source and the destination), the interference channel (two senders and two 
receivers with crosstalk), and the two-way channel (two sender-receiver 
pairs sending information to each other). For all these channels, we have 
only some of the answers to questions about achievable communication 
rates and the appropriate coding strategies. 

All these channels can be considered special cases of a general com¬ 
munication network that consists of m nodes trying to communicate with 
each other, as shown in Figure 15.3. At each instant of time, the /th node 
sends a symbol X[ that depends on the messages that it wants to send 
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FIGURE 15.3. Communication network. 

and on past received symbols at the node. The simultaneous transmis¬ 
sion of the symbols (x\, ^ 2 ,..., x m ) results in random received symbols 
{Y\, Y 2 ,..., Y m ) drawn according to the conditional probability distribu¬ 
tion p(y^ l \ y^ 2 \ ..., x^ 2 \ ..., Here p(.|.) expresses the 

effects of the noise and interference present in the network. If p(.|.) takes 
on only the values 0 and 1, the network is deterministic. 

Associated with some of the nodes in the network are stochastic data 
sources, which are to be communicated to some of the other nodes in the 
network. If the sources are independent, the messages sent by the nodes 
are also independent. However, for full generality, we must allow the 
sources to be dependent. How does one take advantage of the dependence 
to reduce the amount of information transmitted? Given the probability 
distribution of the sources and the channel transition function, can one 
transmit these sources over the channel and recover the sources at the 
destinations with the appropriate distortion? 

We consider various special cases of network communication. We con¬ 
sider the problem of source coding when the channels are noiseless and 
without interference. In such cases, the problem reduces to finding the set 
of rates associated with each source such that the required sources can 
be decoded at the destination with a low probability of error (or appro¬ 
priate distortion). The simplest case for distributed source coding is the 
Slepian-Wolf source coding problem, where we have two sources that 
must be encoded separately, but decoded together at a common node. We 
consider extensions to this theory when only one of the two sources needs 
to be recovered at the destination. 

The theory of flow in networks has satisfying answers in such domains 
as circuit theory and the flow of water in pipes. For example, for the 
single-source single-sink network of pipes shown in Figure 15.4, the max¬ 
imum flow from A to S can be computed easily from the Ford—Fulkerson 
theorem . Assume that the edges have capacities C/ as shown. Clearly, 
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FIGURE 15.4. Network of water pipes. 


the maximum flow across any cut set cannot be greater than the sum 
of the capacities of the cut edges. Thus, minimizing the maximum flow 
across cut sets yields an upper bound on the capacity of the network. The 
Ford-Fulkerson theorem [214] shows that this capacity can be achieved. 

The theory of information flow in networks does not have the same 
simple answers as the theory of flow of water in pipes. Although we 
prove an upper bound on the rate of information flow across any cut set, 
these bounds are not achievable in general. However, it is gratifying that 
some problems, such as the relay channel and the cascade channel, admit 
a simple max-flow min-cut interpretation. Another subtle problem in the 
search for a general theory is the absence of a source-channel separation 
theorem, which we touch on briefly in Section 15.10. A complete theory 
combining distributed source coding and network channel coding is still 
a distant goal. 

In the next section we consider Gaussian examples of some of the 
basic channels of network information theory. The physically motivated 
Gaussian channel lends itself to concrete and easily interpreted answers. 
Later we prove some of the basic results about joint typicality that we use 
to prove the theorems of multiuser information theory. We then consider 
various problems in detail: the multiple-access channel, the coding of cor¬ 
related sources (Slepian-Wolf data compression), the broadcast channel, 
the relay channel, the coding of a random variable with side information, 
and the rate distortion problem with side information. We end with an 
introduction to the general theory of information flow in networks. There 
are a number of open problems in the area, and there does not yet exist a 
comprehensive theory of information networks. Even if such a theory is 
found, it may be too complex for easy implementation. But the theory will 
be able to tell communication designers how close they are to optimality 
and perhaps suggest some means of improving the communication rates. 
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15.1 GAUSSIAN MULTIPLE-USER CHANNELS 

Gaussian multiple-user channels illustrate some of the important features 
of network information theory. The intuition gained in Chapter 9 on the 
Gaussian channel should make this section a useful introduction. Here the 
key ideas for establishing the capacity regions of the Gaussian multiple- 
access, broadcast, relay, and two-way channels will be given without 
proof. The proofs of the coding theorems for the discrete memoryless 
counterparts to these theorems are given in later sections of the chapter. 

The basic discrete-time additive white Gaussian noise channel with 
input power P and noise variance N is modeled by 

Yi = Xi + / = 1,2,..., (15.1) 

where the Z/ are i.i.d. Gaussian random variables with mean 0 and vari¬ 
ance N. The signal X = (X\, X 2 , …， X n ) has a power constraint 

1 n 

< P. (15.2) 

U i=\ 

The Shannon capacity C is obtained by maximizing I{X\ Y) over all 
random variables X such that EX 2 < P and is given (Chapter 9) by 

1 ( P\ 

C = - log 11 + — I bits per transmission. (15.3) 

In this chapter we restrict our attention to discrete-time memoryless chan¬ 
nels; the results can be extended to continuous-time Gaussian channels. 

15.1.1 Single-User Gaussian Channel 

We first review the single-user Gaussian channel studied in Chapter 9. 
Here Y = X + Z. Choose a rate R < \ log(l + 备 ). Fix a good (2 nR , n) 
codebook of power P. Choose an index w in the set 2 nR . Send the 
wth codeword X(u;) from the codebook generated above. The receiver 
observes Y = X(w) + Z and then finds the index w of the codeword 
closest to Y. If n is sufficiently large, the probability of error Pr(w ^ w) 
will be arbitrarily small. As can be seen from the definition of joint typ¬ 
icality, this minimum-distance decoding scheme is essentially equivalent 
to finding the codeword in the codebook that is jointly typical with the 
received vector Y. 
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15.1.2 Gaussian Multiple-Access Channel with m Users 

We consider m transmitters, each with a power P. Let 


Let 


Y = J^Xi + Z. 

i=\ 



( 15 . 4 ) 


( 15 . 5 ) 


denote the capacity of a single-user Gaussian channel with signal-to-noise 
ratio P/N. The achievable rate region for the Gaussian channel takes on 
the simple form given in the following equations: 

Ri<C ⑸ ( 15 . 6 ) 

Ri + Rj<C (^P) (15 . 7) 

Ri + Rj + R k <C ( 15 . 8 ) 

: ( 15 . 9 ) 

f><C($). (15-10) 


Note that when all the rates are the same, the last inequality dominates 
the others. 

Here we need m codebooks, the /th codebook having 2 nRi codewords 
of power P. Transmission is simple. Each of the independent transmitters 
chooses an arbitrary codeword from its own codebook. The users send 
these vectors simultaneously. The receiver sees these codewords added 
together with the Gaussian noise Z. 

Optimal decoding consists of looking for the m codewords, one from 
each codebook, such that the vector sum is closest to Y in Euclidean 
distance. If (/?i, /? 2 ,. •., R m ) is in the capacity region given above, the 
probability of error goes to 0 as n tends to infinity. 


15.1 GAUSSIAN MULTIPLE-USER CHANNELS 


515 


Remarks It is exciting to see in this problem that the sum of the rates 
of the users C(mP/N) goes to infinity with m. Thus, in a cocktail party 
with m celebrants of power P in the presence of ambient noise N ， the 
intended listener receives an unbounded amount of information as the 
number of people grows to infinity. A similar conclusion holds, of course, 
for ground communications to a satellite. Apparently, the increasing inter¬ 
ference as the number of senders m ^ oo does not limit the total received 
information. 

It is also interesting to note that the optimal transmission scheme here 
does not involve time-division multiplexing. In fact, each of the transmit¬ 
ters uses all of the bandwidth all of the time. 


15.1.3 Gaussian Broadcast Channel 

Here we assume that we have a sender of power P and two distant 
receivers, one with Gaussian noise power N\ and the other with Gaussian 
noise power Without loss of generality, assume that N\ < Thus, 
receiver Y\ is less noisy than receiver Y 2 . The model for the channel is 
Y\ = X + Z\ and Y 2 = X + Z 2 , where Z\ and Z 2 are arbitrarily corre¬ 
lated Gaussian random variables with variances N\ and N 2 , respectively. 
The sender wishes to send independent messages at rates R\ and R 2 to 
receivers Y\ and Y 2 , respectively. 

Fortunately, all scalar Gaussian broadcast channels belong to the class 
of degraded broadcast channels discussed in Section 15.6.2. Specializing 
that work, we find that the capacity region of the Gaussian broadcast 
channel is 


R { <C 


aP\ 

~i) 


\aP + N 2 ) 


(15.11) 

(15.12) 


where a may be arbitrarily chosen (0 < a < 1) to trade off rate R\ for 
rate R 2 as the transmitter wishes. 

To encode the messages, the transmitter generates two codebooks, one 
with power aP at rate R\, and another codebook with power aP at rate 
R 2 , where R\ and R 2 lie in the capacity region above. Then to send 
an index w\ g {1,2,..., 2 nRl } and W 2 G {1,2,..., 2 nR2 } to Y\ and Y 2 , 
respectively, the transmitter takes the codeword X(m ； i) from the first code¬ 
book and codeword X(w 2 ) from the second codebook and computes the 
sum. He sends the sum over the channel. 
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The receivers must now decode the messages. First consider the bad 
receiver Y 2 . He merely looks through the second codebook to find the clos¬ 
est codeword to the received vector Y 2 . His effective signal-to-noise ratio is 
aP/(aP + A^ 2 ), since iVs message acts as noise to Y 2 . (This can be proved.) 

The good receiver Y\ first decodes IVs codeword, which he can accom¬ 
plish because of his lower noise N\. He subtracts this codeword X 2 from 
Yi. He then looks for the codeword in the first codebook closest to 
Yi — X 2 . The resulting probability of error can be made as low as desired. 

A nice dividend of optimal encoding for degraded broadcast channels is 
that the better receiver Y\ always knows the message intended for receiver 
Y 2 in addition to the message intended for himself. 


15.1.4 Gaussian Relay Channel 

For the relay channel, we have a sender X and an ultimate intended 
receiver Y. Also present is the relay channel, intended solely to help the 
receiver. The Gaussian relay channel (Figure 15.31 in Section 15.7) is 
given by 


Yx = X + Zu (15.13) 

Y = X + Z l + X 1 + Z 2 , (15.14) 


where Z\ and Z 2 are independent zero-mean Gaussian random variables 
with variance N\ and N 2 , respectively. The encoding allowed by the relay 
is the causal sequence 

知 m. • ，〜 -i). (i5.i5) 


Sender X has power P and sender X\ has power Pi. The capacity is 


C = max min 

0<a<l 


c 


p + Pi + 2JWPP[ 
N x + N 2 



where a = l — a. Note that if 


n 

N 2 _ Ni ， 


(15.16) 


(15.17) 


it can be seen that C = C(P/A^i),which is achieved by a = l. The channel 
appears to be noise-free after the relay, and the capacity C(P/N\) from 
X to the relay can be achieved. Thus, the rate C(P/(N\ + A^)) without 
the relay is increased by the presence of the relay to C(P/N\). For large 
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N 2 and for P 1 /N 2 > P/N\, we see that the increment in rate is from 
C(P/(N { + ■ ^0 to C{P/N X ). 

Let R\ < C{aP/N\). Two codebooks are needed. The first codebook 
has 2 nRl words of power aP. The second has 2 nR ° codewords of power 
aP. We shall use codewords from these codebooks successively to cre¬ 
ate the opportunity for cooperation by the relay. We start by sending a 
codeword from the first codebook. The relay now knows the index of 
this codeword since R\ < C(aP/N\), but the intended receiver has a list 
of possible codewords of size 2 n 、 R ' -c 、 aP ’、 Nl+Nl 、、、. This list calculation 
involves a result on list codes. 

In the next block, the transmitter and the relay wish to cooperate to 
resolve the receiver’s uncertainty about the codeword sent previously that 
is on the receiver’s list. Unfortunately, they cannot be sure what this list 
is because they do not know the received signal Y. Thus, they randomly 
partition the first codebook into 2 nR ° cells with an equal number of code¬ 
words in each cell. The relay, the receiver, and the transmitter agree on 
this partition. The relay and the transmitter find the cell of the partition 
in which the codeword from the first codebook lies and cooperatively 
send the codeword from the second codebook with that index. That is, X 
and X\ send the same designated codeword. The relay, of course, must 
scale this codeword so that it meets his power constraint P\. They now 
transmit their codewords simultaneously. An important point to note here 
is that the cooperative information sent by the relay and the transmitter 
is sent coherently. So the power of the sum as seen by the receiver Y is 

However, this does not exhaust what the transmitter does in the second 
block. He also chooses a fresh codeword from the first codebook, adds it 
“on paper” to the cooperative codeword from the second codebook, and 
sends the sum over the channel. 

The reception by the ultimate receiver Y in the second block involves 
first finding the cooperative index from the second codebook by looking 
for the closest codeword in the second codebook. He subtracts the code¬ 
word from the received sequence and then calculates a list of indices of 
size 2 nR ° corresponding to all codewords of the first codebook that might 
have been sent in the second block. 

Now it is time for the intended receiver to complete computing the 
codeword from the first codebook sent in the first block. He takes his list 
of possible codewords that might have been sent in the first block and 
intersects it with the cell of the partition that he has learned from the 
cooperative relay transmission in the second block. The rates and powers 
have been chosen so that it is highly probable that there is only one 


518 


NETWORK INFORMATION THEORY 


codeword in the intersection. This is Y's guess about the information sent 
in the first block. 

We are now in steady state. In each new block, the transmitter and the 
relay cooperate to resolve the list uncertainty from the previous block. In 
addition, the transmitter superimposes some fresh information from his 
first codebook to this transmission from the second codebook and trans¬ 
mits the sum. The receiver is always one block behind, but for sufficiently 
many blocks, this does not affect his overall rate of reception. 

15.1.5 Gaussian Interference Channel 

The interference channel has two senders and two receivers. Sender 1 
wishes to send information to receiver 1. He does not care what receiver 
2 receives or understands; similarly with sender 2 and receiver 2. Each 
channel interferes with the other. This channel is illustrated in Figure 15.5. 
It is not quite a broadcast channel since there is only one intended receiver 
for each sender, nor is it a multiple access channel because each receiver 
is only interested in what is being sent by the corresponding transmitter. 
For symmetric interference, we have 


(15.18) 

(15.19) 


Fi = Xi + aX 2 + Zi 

Y 2 = X 2 + aX l + Z 2 , 


where Zi, Z 2 are independent A/*(0, N) random variables. This channel has 
not been solved in general even in the Gaussian case. But remarkably, in 
the case of high interference, it can be shown that the capacity region of 
this channel is the same as if there were no interference whatsoever. 

To achieve this, generate two codebooks, each with power P and rate 
C(P/N). Each sender independently chooses a word from his book and 


4 ~Af(0,N) 



Z 2 ~AT(0,N) 


FIGURE 15.5. Gaussian interference channel. 
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sends it. Now, if the interference a satisfies C(a 2 P/(P + N)) > C(P/N), 
the first transmitter understands perfectly the index of the second transmit¬ 
ter. He finds it by the usual technique of looking for the closest codeword 
to his received signal. Once he finds this signal, he subtracts it from 
his waveform received. Now there is a clean channel between him and 
his sender. He then searches the sender’s codebook to find the closest 
codeword and declares that codeword to be the one sent. 

15.1.6 Gaussian Two-Way Channel 

The two-way channel is very similar to the interference channel, with the 
additional provision that sender 1 is attached to receiver 2 and sender 2 
is attached to receiver 1, as shown in Figure 15.6. Hence, sender 1 can 
use information from previous received symbols of receiver 2 to decide 
what to send next. This channel introduces another fundamental aspect 
of network information theory: namely, feedback. Feedback enables the 
senders to use the partial information that each has about the other’s 
message to cooperate with each other. 

The capacity region of the two-way channel is not known in general. 
This channel was first considered by Shannon [486], who derived upper 
and lower bounds on the region (see Problem 15.15). For Gaussian chan¬ 
nels, these two bounds coincide and the capacity region is known; in 
fact, the Gaussian two-way channel decomposes into two independent 
channels. 

Let P\ and 尸 2 be the powers of transmitters 1 and 2, respectively, 
and let N\ and be the noise variances of the two channels. Then 
the rates R\ < C{P\/N\) and R 2 < C ( 尸 2 /W 2 ) can be achieved by the 
techniques described for the interference channel. In this case, we generate 
two codebooks of rates R\ and R 2 . Sender 1 sends a codeword from the 
first codebook. Receiver 2 receives the sum of the codewords sent by the 






> X2 

p(y^,y2\x^,x 2 ) 

^ - ^2 


FIGURE 15.6. Two-way channel. 
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two senders plus some noise. He simply subtracts out the codeword of 
sender 2 and he has a clean channel from sender 1 (with only the noise of 
variance N\). Hence, the two-way Gaussian channel decomposes into two 
independent Gaussian channels. But this is not the case for the general 
two-way channel; in general, there is a trade-off between the two senders 
so that both of them cannot send at the optimal rate at the same time. 


15.2 JOINTLY TYPICAL SEQUENCES 


We have previewed the capacity results for networks by considering mul¬ 
tiuser Gaussian channels. We begin a more detailed analysis in this section, 
where we extend the joint AEP proved in Chapter 7 to a form that we 
will use to prove the theorems of network information theory. The joint 
AEP will enable us to calculate the probability of error for jointly typical 
decoding for the various coding schemes considered in this chapter. 

Let (X\, X 2 , … ， Xk) denote a finite collection of discrete random vari¬ 
ables with some fixed joint distribution, p(x^ l \ x( 2 )，..., x( 灸 ) ），（ x) 1 ) ， x ( 2 )， 
...,G X\ x A 2 x • • • x Xk. Let S denote an ordered subset of these 
random variables and consider n independent copies of S. Thus, 

n 

Pr{5 = $} = ]■"[ Pr{5 / = s t }, s e S' 1 . 

/ =1 

For example, if 5 = (Xj, Z/), then 

Pr{5 = s} = Pr {(Xj, X/) = (x 7 -, x/)} 

n 

= J - J P(.^ij ? I")• 

i=\ 

To be explicit, we will sometimes use X(S) for S. By the law of large 
numbers, for any subset S of random variables, 

1 1 n 

— log p(S u S 2 , … ， Sn) = — Vlog p(Si) -> H(S), (15.23) 

n n 

i=\ 

where the convergence takes place with probability 1 for all 2 k subsets, 

{ X ( D ， X ⑵，…，奸) }• 


(15.20) 

(15.21) 

(15.22) 
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Definition The set of 6-typical "-sequences (xi, X 2 ,..., x^) is 
defined by 

= 4 ") 

=(xi,x 2 ,...,x fc ) : --log p(s) - H(S) <€, V5C {Z (1) ,x (2) ,..., 

n 

X (k )} • (15.24) 


Let denote the restriction of to the coordinates of S. Thus, 

if 5 = (Xi, X 2 ), we have 


A^(^i,^2) = {(xi,x 2 ) : 


1 

— log p(x\, X 2 ) — H(X\, X 2 ) 
n 


< l 


--\ogp{x x )-H{X x ) 

n 


< 


__logp(x 2 ) — H(X 2 ) 
n 


< ^}- 


(15.25) 


Definition We will use the notation a n = 2^^ to mean that 


— log a n — b < 6 
n 


(15.26) 


for n sufficiently large. 


Theorem 15.2.1 For any 6 > 0, for sufficiently large n, 

1. P(Af»(5)) >l-e, V5 c (15.27) 

2. s e A^(S) p(s) = 2 n 汧⑺士 f ). (15.28) 

3. |A^)(S)| = 2 n(tf ⑶士 2e) . (15.29) 
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4. Let Si, S 2 c {X ⑴， X ⑶， ... ， 义 ⑻}. //(si,s 2 ) e A^(5 i,5 2 ), then 
p(si|s 2 ) = 2~ n(H{SllS2)±2€) . (15.30) 

Proof 

1. This follows from the law of large numbers for the random variables 
in the definition of 

2. This follows directly from the definition of A^ n \S). 

3. This follows from 

1> I 尸⑻ (15.31) 

seAi n) (S) 

> 2 -n(ff(S)+e) (15.32) 

seAi n) (S) 

= - n(H(S)+€) . (15.33) 

If n is sufficiently large, we can argue that 

1-G Ms) (15.34) 

seAi n) (S) 

< J2 2- n ( 準 - €) (15.35) 

seAi n) (S) 

= \A^ n \S)\2 - n(H{S) ~ € \ (15.36) 

Combining (15.33) and (15.36), we have ⑻士 for 

sufficiently large n. 

4. For (si, S2) G A^( 5 i, S2), we have p(s\) = and 

P(Si ， S2) = 2~ n( ^ H ^ Sl ' S ^ ±€ \ Hence, 

p(s 2 \si) = P(Sl ’ S2) = 2- n(H{S2lSl)±2e) . □ (15.37) 

P(s\) 

The next theorem bounds the number of conditionally typical sequences 
for a given typical sequence. 
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Theorem 15.2.2 Let S \, S 2 be two subsets of X^ l \ X 、 2 、 ， ... ， X、 k 、. For 
any € > 0, define IS 2 ) to be the set ofs\ sequences that are jointly 6- 

typical with a particular S 2 sequence. If S 2 G ^4^)(5^), then for sufficiently 
large n，we have 

IAPo^IsJI < 2»(«^il^)+2«) (15.38) 

and 

(1 - e)2" ( 卿 l s 2)-2e) < J]p(s 2 )|Af ) (^i|s2)|. (15.39) 

S2 

Proof: As in part 3 of Theorem 15.2.1, we have 


P(Sll S 2) 

(15.40) 

s\eAi n) (Si\s 2 ) 


〉: (S\ |S2)+2e) 

(15.41) 

Sl 64*)(51182) 


|A^ ) (Si|s 2 )|2- n(ff( ^ l ^ )+2e) . 

(15.42) 


If n is sufficiently large, we can argue from (15.27) that 

1 —e<X^(S 2 ) 户 (Sil s 2 ) (15.43) 

< p(S 2 ) 2 -n(H{S l \S 2 )-2e) (15.44) 

82 s^A^CSilsz) 

=^P(s2)|Af ) (5i|s 2 )|2- n(H(5ll ^ ) - 2e) . □ (15.45) 

S2 

To calculate the probability of decoding error, we need to know the 
probability that conditionally independent sequences are jointly typical. 
Let S\, S 2 , and S 3 be three subsets of { 又⑴， X ⑵， ... ， X( 々 )}. If 5 1 ; and 
S ’ 2 are conditionally independent given 名 but otherwise share the same 
pairwise marginals of (S\, S 2 , S 3 ), we have the following probability of 
joint typicality. 
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Theorem 15.2.3 Let denote the typical set for the probability mass 
function p(s\, S 2 , ^ 3 ), and let 

n 

P(S\ = Si, S; = S 2 , s ’ 3 = S 3 ) = ]~[ p(Su\S 3 i)p (S2i \S3i)p (s^i ). (15.46) 

i=\ 

Then 

P{(S\,S 2 ， S f 3 ) e A^} = 2 n(/(5l；52|53)±6€) . (15.47) 

Proof: We use the = notation from (15.26) to avoid calculating the 
upper and lower bounds separately. We have 

P{(S; ， S^ ， S ， 3 )e 4 ")} 

= p(S 3 )p(Si\S i )p(S2\S 3 ) 

(Si,S 2 ,S 3 )€A^ ) 

;IA ⑻ （ 5*1 5^ ”( 开 ($ 3 ) 士 € )2— w ( 尺⑶防 ) 士 2 ()2 — n ( 好 ( 为防 ) 士 2 e ) 

= 2 n ( H (Sl,S 2 ,S3)±€)2-n(H(S 3 )±€)2-n(H(S l \S3)±2€)2-n(H(S2\S 3 )±2 € ) 

We will specialize this theorem to particular choices of S\, S 2 , and S 3 
for the various achievability proofs in this chapter. 

15.3 MULTIPLE-ACCESS CHANNEL 

The first channel that we examine in detail is the multiple-access channel, 
in which two (or more) senders send information to a common receiver. 
The channel is illustrated in Figure 15.7. A common example of this 
channel is a satellite receiver with many independent ground stations, or 
a set of cell phones communicating with a base station. We see that the 
senders must contend not only with the receiver noise but with interference 
from each other as well. 

Definition A discrete memoryless multiple-access channel consists of 
three alphabets, X\, X 2 , and y, and a probability transition matrix 
p(y\xi,x 2 ). 


(15.48) 

(15.49) 

(15.50) 

(15.51) 
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W-\ ~ > X-\ ~ > 

l^ 2 - ► ( - ► 


- ^ Y - ► (W v W 2 ) 


FIGURE 15.7. Multiple-access channel. 

Definition A ((2 nRl , 2 nRl ), n) code for the multiple-access channel con¬ 
sists of two sets of integers Wi = {1, 2, •. • ， 2 nR[ } and W 2 = {1，2, • •. ， 
2 nR2 }, called the message sets, two encoding functions, 

Zi : Wi ^ (15.52) 

and 

X 2 ： rV 2 ^ ^ (15.53) 

and a decoding function, 

g ： y n W 2 . (15.54) 


p(y\^,x 2 ) 


There are two senders and one receiver for this channel. Sender 1 
chooses an index W\ uniformly from the set {1,2,..., 2 nRl ] and sends 
the corresponding codeword over the channel. Sender 2 does likewise. 
Assuming that the distribution of messages over the product set Wi x W 2 
is uniform (i.e., the messages are independent and equally likely), we 
define the average probability of error for the ((2 nRl , 2 nRl ), n) code as 
follows: 

p e n) = 2 n(R l+ R 2 ) Pr {g(y n ) ^ (w ； i, m; 2 )I(^i, w 2 ) sent}. 

(w\,W2)eW\xW2 

(15.55) 

Definition A rate pair (R\, R 2 ) is said to be achievable for the multiple- 
access channel if there exists a sequence of ((2 nRl , 2 nRl ), n) codes with 

p e in) 0 . 
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Definition The capacity region of the multiple-access channel is the 
closure of the set of achievable (R\, R 2 ) rate pairs. 

An example of the capacity region for a multiple-access channel is 
illustrated in Figure 15.8. We first state the capacity region in the form of 
a theorem. 

Theorem 15.3.1 (Multiple-access channel capacity) The capacity of 
a multiple-access channel {X\ x A 2 , p(y\x\, X 2 ), M is the closure of the 
convex hull of all (R\, R 2 ) satisfying 


</(x 1 ； y|x 2 ), 

(15.56) 

R 2 <I(X 2 ;Y\X l ), 

(15.57) 

R 1 + R 2 <I(X 1 ,X 2 ； Y) 

(15.58) 


for some product distribution pi(xi)/?2( 义 2 ) on X\ x X 2 . 

Before we prove that this is the capacity region of the multiple-access 
channel, let us consider a few examples of multiple-access channels: 

Example 15.3. 7 (Independent binary symmetric channels) Assume 
that we have two independent binary symmetric channels, one from sender 
1 and the other from sender 2, as shown in Figure 15.9. In this case, it is 
obvious from the results of Chapter 7 that we can send at rate 1 — H(p\) 
over the first channel and at rate 1 — H(p 2 ) over the second channel. 


丹 2 


0 R 、 



FIGURE 15.8. Capacity region for a multiple-access channel. 






15.3 MULTIPLE-ACCESS CHANNEL 


527 



Y 



Since the channels are independent, there is no interference between the 
senders. The capacity region in this case is shown in Figure 15.10. 

Example 15.3.2 (Binary multiplier channel) Consider a multiple- 
access channel with binary inputs and output 

Y = X x X 2 . (15.59) 

Such a channel is called a binary multiplier channel. It is easy to see 
that by setting X 2 = 1， we can send at a rate of 1 bit per transmission 
from sender 1 to the receiver. Similarly, setting Xi = 1, we can achieve 
/?2 = 1. Clearly, since the output is binary, the combined rates R\ + R: 
of sender 1 and sender 2 cannot be more than 1 bit. By timesharing, we 
can achieve any combination of rates such that R\ + R 2 = 1. Hence the 
capacity region is as shown in Figure 15.11. 

Example 15.3.3 (Binary erasure multiple-access channel) This 
multiple-access channel has binary inputs, X\ = X2 = {0, 1}, and a 
ternary output, Y = X\ + X^. There is no ambiguity in (X\, X2) if F = 0 
or F = 2 is received; but Y = l can result from either (0,1) or (1,0). 
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R 2 

= 1 - H(p 2 ) _ 


0 C 1 ； =1- 

FIGURE 15.10. Capacity region for independent BSCs. 



FIGURE 15.11. Capacity region for binary multiplier channel. 

We now examine the achievable rates on the axes. Setting X 2 = 0, we 
can send at a rate of 1 bit per transmission from sender 1. Similarly, setting 
X\ = 0, we can send at a rate R 2 = l. This gives us two extreme points of 
the capacity region. Can we do better? Let us assume that R\ = 1, so that 
the codewords of X\ must include all possible binary sequences; X\ would 
look like a Bernoulli(^) process. This acts like noise for the transmission 
from X 2 - For X 2 , the channel looks like the channel in Figure 15.12. 
This is the binary erasure channel of Chapter 7. Recalling the results, the 








15.3 MULTIPLE-ACCESS CHANNEL 


529 



FIGURE 15.12. Equivalent single-user channel for user 2 of a binary erasure multiple- 
access channel. 



FIGURE 15.13. Capacity region for binary erasure multiple-access channel. 


capacity of this channel is \ bit per transmission. Hence when sending at 
maximum rate 1 for sender 1， we can send an additional \ bit from sender 
2. Later, after deriving the capacity region, we can verify that these rates 
are the best that can be achieved. The capacity region for a binary erasure 
channel is illustrated in Figure 15.13. 
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15.3.1 Achievability of the Capacity Region for the 
Multiple-Access Channel 

We now prove the achievability of the rate region in Theorem 15.3.1; 
the proof of the converse will be left until the next section. The proof 
of achievability is very similar to the proof for the single-user channel. 
We therefore only emphasize the points at which the proof differs from 
the single-user case. We begin by proving the achievability of rate pairs 
that satisfy (15.58) for some fixed product distribution p(x\)p(x 2 ) - In 
Section 15.3.3 we extend this to prove that all points in the convex hull 
of (15.58) are achievable. 


Proof: (Achievability in Theorem 15.3.1). Fix p(x\, X 2 ) = Pi(ii)P 2 ( 义 2 ). 

Codebook generation: Generate 2 nRl independent codewords Xi(/), 
i e { 1 , 2 ,..., 2 nRl }, of length n ， generating each element i.i.d. 
〜 \Yi = \ Similarly, generate 2 nRl independent codewords X 2 O*), 

j G {1,2,..., 2 nRl ), generating each element i.i.d • 〜 P 2 ( 义 2 /). These 
codewords form the codebook, which is revealed to the senders and the 
receiver. 

Encoding: To send index /, sender 1 sends the codeword Xi(/). Simi¬ 
larly, to send 7 , sender 2 sends X 2 O'). 

Decoding: Let denote the set of typical (xi, X 2 , y) sequences. The 
receiver Y n chooses the pair (/, j) such that 

(xdi),x 2 (j),y) e A[ n) (15.60) 

if such a pair (/, j) exists and is unique; otherwise, an error is de¬ 
clared. 

Analysis of the probability of error: By the symmetry of the random 
code construction, the conditional probability of error does not depend on 
which pair of indices is sent. Thus, the conditional probability of error 
is the same as the unconditional probability of error. So, without loss of 
generality, we can assume that (/, j) = ( 1 ， 1 ) was sent. 

We have an error if either the correct codewords are not typical with 
the received sequence or there is a pair of incorrect codewords that are 
typical with the received sequence. Define the events 


^ 7 = {(X 1 (0,X 2 0*),Y)G A^>}. 


(15.61) 


Then by the union of events bound, 
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^ n) = ^(^iU U ^')#(U)^) (15.62) 

< P(E c n ) + P(E n ) + E 

+ J2 尸 ( 五" )， (15.63) 

where P is the conditional probability given that (1 ， 1) was sent. From the 
AEP, P{E c n ) ^ 0. By Theorems 15.2.1 and 15.2.3, for i ^ 1, we have 

P(En) = P((X 1 (0,X 2 (1),Y) g A^) 

= p(xi)p(x 2 ,y) 

(xi,x 2 ,y)eA^ ) 

< IA ⑻ |2- w (好(义1)- € )2— w (好 

< 2 - n ( H ( X l)+ H ( X 2^)~H(X h X 2 ,Y)-3€) 

= 2-n(I(X l -,X 2 ,Y)-3€) 

= 2-n(I(X V ,Y\X 2 )-3€) 


(15.64) 

(15.65) 

(15.66) 

(15.67) 

(15.68) 

(15.69) 


where the equivalence of (15.68) and (15.69) follows from the indepen¬ 
dence of X\ and X 2 , and the consequent I(X \； X 2 , Y) = I(X {； X 2 ) + 
I(Xu Y\X 2 ) = /(Xi ； Y\X 2 ). Similarly, for j ^ 1, 

P(Eij) < 2- n(/(Z2；F|Zl) - 3f) , (15.70) 

and for i ^ l, j ^ l, 

P(Eij) < 2- n ( / ( x i' X 2 ； y)-4€)_ (15.71) 


It follows that 

P e (n) < P{E c n ) + 2 nR ^2~ n 


(I(X V ,Y\X 2 )-3€) + 2 nR22- n ( I ( X 2^Y\X l )-3€) 


+ 2n(Ri+R 2 )2- n ( I ( x h x 2\Y)-4€) 


(15.72) 


Since 6 > 0 is arbitrary, the conditions of the theorem imply that each 
term tends to 0 as n —> oc. Thus, the probability of error, conditioned 
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on a particular codeword being sent, goes to zero if the conditions of the 
theorem are met. The above bound shows that the average probability of 
error, which by symmetry is equal to the probability for an individual 
codeword, averaged over all choices of codebooks in the random code 
construction, is arbitrarily small. Hence, there exists at least one code C* 
with arbitrarily small probability of error. 

This completes the proof of achievability of the region in (15.58) for 
a fixed input distribution. Later, in Section 15.3.3, we show that time¬ 
sharing allows any (Ri, R 2 ) in the convex hull to be achieved, completing 
the proof of the forward part of the theorem. □ 

15.3.2 Comments on the Capacity Region for the 
Multiple-Access Channel 

We have now proved the achievability of the capacity region of the 
multiple-access channel, which is the closure of the convex hull of the set 
of points (R\, R 2 ) satisfying 


t?! </(Xi ； y|z 2 ), 

(15.73) 

i? 2 </(x 2 ； y|^i), 

(15.74) 

Rx + R.^nXuX^Y) 

(15.75) 

for some distribution p\{x\)p 2 {x 2 ) on X\ x For 

Pi(xi)p 2 (x 2 ), the region is illustrated in Figure 15.14. 

a particular 



FIGURE 15.14. Achievable region of multiple-access channel for a fixed input distribution. 
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Let us now interpret the comer points in the region. Point A corresponds 
to the maximum rate achievable from sender 1 to the receiver when sender 
2 is not sending any information. This is 

max R { = max Y\X 2 ). (15.76) 

Pl(xi)p 2 (x 2 ) 

Now for any distribution p\{x\)p 2 {x 2 ), 

/(^i ； ^1^2) = J^P2(x 2 )I(X l -, 7|X 2 =x 2 ) (15.77) 

< max/(Xi ； Y\X 2 = x 2 ), (15.78) 

X2 

since the average is less than the maximum. Therefore, the maximum 
in (15.76) is attained when we set X 2 = X 2 , where X 2 is the value that 
maximizes the conditional mutual information between X\ and Y. The 
distribution of Xi is chosen to maximize this mutual information. Thus, 
X 2 must facilitate the transmission of X\ by setting X 2 = 

The point B corresponds to the maximum rate at which sender 2 can 
send as long as sender 1 sends at his maximum rate. This is the rate that 
is obtained if Xi is considered as noise for the channel from X 2 to Y. In 
this case, using the results from single-user channels, X 2 can send at a 
rate 1 (X 2 ; Y). The receiver now knows which X 2 codeword was used and 
can “subtract” its effect from the channel. We can consider the channel 
now to be an indexed set of single-user channels, where the index is the 
X 2 symbol used. The X\ rate achieved in this case is the average mutual 
information, where the average is over these channels, and each channel 
occurs as many times as the corresponding X 2 symbol appears in the 
codewords. Hence, the rate achieved is 

J^p(x 2 )I(X l - Y\X 2 = x 2 ) = /(Xi ； 7|X 2 ). (15.79) 


Points C and D correspond to B and A, respectively, with the roles of the 
senders reversed. The noncorner points can be achieved by time-sharing. 
Thus, we have given a single-user interpretation and justification for the 
capacity region of a multiple-access channel. 

The idea of considering other signals as part of the noise, decoding 
one signal, and then “subtracting” it from the received signal is a very 
useful one. We will come across the same concept again in the capacity 
calculations for the degraded broadcast channel. 
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15.3.3 Convexity of the Capacity Region of the Multiple-Access 
Channel 

We now recast the capacity region of the multiple-access channel in order 
to take into account the operation of taking the convex hull by introducing 
a new random variable. We begin by proving that the capacity region is 
convex. 

Theorem 15.3.2 The capacity region C of a multiple-access channel 
is convex [i.e., if (R\, R2) G C and (R[，R’ 2 ) G C，then {XR\ + (1 — X)R f v 
入尺 2 + (1 - 入 X) e Cfor 0 $ 入 $ 1]. 

Proof: The idea is time-sharing. Given two sequences of codes at dif¬ 
ferent rates R = (/?i, R2) and R f = (R[, we can construct a third 
codebook at a rate 入 R+ (1 — 入 ) R' by using the first codebook for the 
first 入 w symbols and using the second codebook for the last (1 — 入 ) n 
symbols. The number of X\ codewords in the new code is 

入尺 i2”(i - 入 ) 尺 ’1 = 2 n ( 入沢 i+(i - 入 ) 沢 ’l) (15 80) 

and hence the rate of the new code is 入 R + (1 — 入 ) R’. Since the overall 
probability of error is less than the sum of the probabilities of error for 
each of the segments, the probability of error of the new code goes to 0 
and the rate is achievable. □ 

We can now recast the statement of the capacity region for the multiple- 
access channel using a time-sharing random variable Q. Before we prove 
this result, we need to prove a property of convex sets defined by lin¬ 
ear inequalities like those of the capacity region of the multiple-access 
channel. In particular, we would like to show that the convex hull of two 
such regions defined by linear constraints is the region defined by the 
convex combination of the constraints. Initially, the equality of these two 
sets seems obvious, but on closer examination, there is a subtle difficulty 
due to the fact that some of the constraints might not be active. This is 
best illustrated by an example. Consider the following two sets defined 
by linear inequalities: 

Ci = {(x, j) : x > 0, }； > 0, x < 10, j < 10, x + 3 ； < 100} (15.81) 
C 2 = {(x, y) ： x > 0, y > x < 20, y < 20, x + y < 20}. (15.82) 

In this case, the (|，|) convex combination of the constraints defines the 
region 


C = {(x, j) : x > 0, y > 0, x < 15, 3 ； < 15, x + y < 60}. (15.83) 
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It is not difficult to see that any point in C\ or C 2 has x + y < 20， so 
any point in the convex hull of the union of C\ and C 2 satisfies this 
property. Thus, the point (15,15)，which is in C, is not in the convex hull 
of (Ci U C 2 ). This example also hints at the cause of the problem — in the 
definition for Ci，the constraint x + y < 100 is not active. If this constraint 
were replaced by a constraint x + y < a, where a < 20, the above result 
of the equality of the two regions would be true, as we now prove. 

We restrict ourselves to the pentagonal regions that occur as compo¬ 
nents of the capacity region of a two-user multiple-access channel. In 
this case, the capacity region for a fixed p{x\)p{x 2 ) is defined by three 
mutual informations ， /(Xi; Y\X 2 ), /(^ 2 ； F|Xi), and I(X{, F), which 

we shall call I\, I 2 , and , 3 , respectively. For each p(x\)p(x 2 ), there is a 
corresponding vector, I = (I\, I 2 , 13 ), and a rate region defined by 

C\ = R 2 ) R\ > 0, R 2 > 0, R\ < / 1 , /?2 < h, R\ + R 2 S h}- 

(15.84) 

Also, since for any distribution p(x\)p(x 2 ), we have /(X 2 ； Y\X\)= 
- H(X 2 \Y, X { ) = H(X 2 ) — H(X 2 \Y, X { ) = /(X 2 ； Y, X x )= 
/(X 2 ; Y) + /(X 2; X x \Y) > /(X 2 ; Y), and therefore, /(X 1; Y\X 2 ) + 
J(X 2 ; Y\Xi) > I(Xr ， Y\X 2 ) + I(X 2 ； Y) = I(X U X 2 ； F), we have for all 
vectors I that h + I 2 匕 h. This property will turn out to be critical for 
the theorem. 

Lemma 15.3.1 Let Ii, I 2 g V? be two vectors of mutual informations 
that define rate regions C\ x and C\ v respectively, as given in (15.84). 
For 0 < A < 1, define I 入 =AIi + (1 — 入）工 2 , and let C\ x be the rate region 
defined by 1^. Then 


Ci x = XC h + (1 - 入 ) C l2 . (15.85) 


Proof: We shall prove this theorem in two parts. We first show that 
any point in the ( 入 ， 1 — 入 ） mix of the sets C\ x and C\ 2 satisfies the con¬ 
straints I 入 . But this is straightforward, since any point in C\ x satisfies the 
inequalities for I\ and a point in C\ 2 satisfies the inequalities for I 2 , so 
the ( 入 ， 1 — 入 ） mix of these points will satisfy the ( 入 ，1 — 入 ） mix of the 
constraints. Thus, it follows that 

XCi x + (1 - A)Ci 2 c C h . (15.86) 


To prove the reverse inclusion, we consider the extreme points of the 
pentagonal regions. It is not difficult to see that the rate regions defined 
in (15.84) are always in the form of a pentagon, or in the extreme case 
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when 73 = /i + /〗， in the form of a rectangle. Thus, the capacity region 
Ci can be also defined as a convex hull of five points: 

(0, 0), (/i, 0), (/i, h - h), ih - h, h), (0, / 2 ). (15.87) 

Consider the region defined by 1^,; it, too, is defined by five points. Take 
any one of the points, say (1^ — / 』 入 )，/^). This point can be written as 
the ( 入， 1 — 入 ） mix of the points (/^ 1 ^ — \ I^) and ( 1 ^ — I^ 2 \ g 2 ))， 

and therefore lies in the convex mixture of C\ x and C\ 2 . Thus, all extreme 
points of the pentagon Ci A lie in the convex hull of C\ x and Ci 2 , or 

C h c 入 q + (1 - 入 ) C l2 . (15.88) 

Combining the two parts, we have the theorem. □ 

In the proof of the theorem, we have implicitly used the fact that all 
the rate regions are defined by five extreme points (at worst, some of 
the points are equal). All five points defined by the I vector were within 
the rate region. If the condition I 3 < I\ + I 2 is not satisfied, some of the 
points in (15.87) may be outside the rate region and the proof collapses. 

As an immediate consequence of the above lemma, we have the fol¬ 
lowing theorem: 

Theorem 15.3.3 The convex hull of the union of the rate regions defined 
by individual I vectors is equal to the rate region defined by the convex 
hull of the I vectors. 

These arguments on the equivalence of the convex hull operation on 
the rate regions with the convex combinations of the mutual informa¬ 
tions can be extended to the general m-user multiple-access channel. A 
proof along these lines using the theory of polymatroids is developed in 
Han [271]. 

Theorem 15.3.4 The set of achievable rates of a discrete memoryless 
multiple-access channel is given by the closure of the set of all (R\, R 2 ) 
pairs satisfying 


Ri <I(X l -Y\X 2 , G), 

R 2 <I(X 2 ； Y\X u Q), 

R { + R 2 <I(X u X 2 ； Y\Q) (15.89) 
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for some choice of the joint distribution p(q)p(x\\q)p(x 2 \q)p(y\x \, X 2 ) 
with \Q\ < 4. 

Proof: We will show that every rate pair lying in the region defined in 
(15.89) is achievable (i.e., it lies in the convex closure of the rate pairs 
satisfying Theorem 15.3.1). We also show that every point in the convex 
closure of the region in Theorem 15.3.1 is also in the region defined 
in (15.89). 

Consider a rate point R satisfying the inequalities (15.89) of the theo¬ 
rem. We can rewrite the right-hand side of the first inequality as 

m 

/(X 1; y|Z 2 , Q) = J^ p{q)I{X {] Y\X 2 , Q = q) (15.90) 

分 =1 
m 

= J2p(^nXi-,Y\X 2 ) p ^ P2q , (15.91) 

分 =1 

where m is the cardinality of the support set of Q. We can expand the 
other mutual informations similarly. 

For simplicity in notation, we consider a rate pair as a vector and 
denote a pair satisfying the inequalities in (15.58) for a specific input 
product distribution p\ q {x\)p 2 q {x 2 ) as R P1 ， P2 as Specifically, let = 
(R\ q , R 2 q) be a rate pair satisfying 

Rlq < 1( 义 1; ^1^2) pi q (x\ ) P 2 q (X 2 ) ^ (15.92) 

R:q < 1 (X 2 ' Y\Xi)p lq ^ Xl ) P2q ^ X2 ), (15.93) 

R\q + Rlq < I (^1, ^2 ； Y )pi q (xi)p 2q (x 2 ) - (15.94) 

Then by Theorem 15.3.1 ， K q = (R\ q , R 2 q ) is achievable. Then since R 
satisfies (15.89) and we can expand the right-hand sides as in (15.91 )， 
there exists a setof pairs R q satisfying (15.94) such that 

m 

R = J2P^r (15.95) 

分 =1 

Since a convex combination of achievable rates is achievable, so is R. 
Hence, we have proven the achievability of the region in the theorem. 
The same argument can be used to show that every point in the convex 
closure of the region in (15.58) can be written as the mixture of points 
satisfying (15.94) and hence can be written in the form (15.89). 
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The converse is proved in the next section. The converse shows that 
all achievable rate pairs are of the form (15.89)，and hence establishes 
that this is the capacity region of the multiple-access channel. The cardi¬ 
nality bound on the time-sharing random variable Q is 3. consequence of 
Carath6odory’s theorem on convex sets. See the discussion below. □ 

The proof of the convexity of the capacity region shows that any convex 
combination of achievable rate pairs is also achievable. We can continue 
this process, taking convex combinations of more points. Do we need to 
use an arbitrary number of points ? Will the capacity region be increased? 
The following theorem says no. 

Theorem 15.3.5 {Caratheodory) Any point in the convex closure of a 
compact set A in a d-dimensional Euclidean space can be represented as 
a convex combination ofd + 1 or fewer points in the original set A. 

Proof: The proof may be found in Eggleston [183] and Grilnbaum 
[263]. □ 


This theorem allows us to restrict attention to a certain finite convex 
combination when calculating the capacity region. This is an important 
property because without it, we would not be able to compute the capacity 
region in (15.89)，since we would never know whether using a larger 
alphabet Q would increase the region. 

In the multiple-access channel, the bounds define a connected compact 
set in three dimensions. Therefore, all points in its closure can be defined 
as the convex combination of at most four points. Hence, we can restrict 
the cardinality of Q to at most 4 in the above definition of the capacity 
region. 


Remark Many of the cardinality bounds may be slightly improved by 
introducing other considerations. For example, if we are only interested 
in the boundary of the convex hull of A as we are in capacity theorems, 
a point on the boundary can be expressed as a mixture of d points of 
A, since a point on the boundary lies in the intersection of A with a 
(d — l)-dimensional support hyperplane. 


15.3.4 Converse for the Multiple-Access Channel 

We have so far proved the achievability of the capacity region. In this 
section we prove the converse. 
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Proof: (Converse to Theorems 15.3.1 and 15.3.4). We must show that 
given any sequence of ((2 nR{ , 2 nRl ), n) codes with 0, the rates 

must satisfy 


/?i</(Zi ； y|X 2 , Q), 
R2<I(X 2 ； Y\X U Q), 


(15.96) 


Ri + R 2 <nX u X 2 ;Y\Q) 


for some choice of random variable Q defined on {1, 2, 3, 4} and joint 
distribution p(q)p(x\\q)p(x 2 \q)p(y\x\, X 2 ). Fix n. Consider the given 
code of block length n. The joint distribution on Wi x W 2 x x x 
y n is well defined. The only randomness is due to the random uniform 
choice of indices W\ and W 2 and the randomness induced by the channel. 
The joint distribution is 



where p^x^lwi) is either 1 or 0, depending on whether = xi(k ； i), the 
codeword corresponding to w\, or not, and similarly, p(x^|w ； 2 ) = 1 or 0, 
according to whether = ^ 2 (^ 2 ) or not. The mutual informations that 
follow are calculated with respect to this distribution. 

By the code construction, it is possible to estimate (W\, W2) from the 
received sequence Y n with a low probability of error. Hence, the condi¬ 
tional entropy of (W\, W2) given Y n must be small. By Fano’s inequality, 

H(W U W 2 \Y n ) < n(/?i + R 2 )P^ n) + H(P^ n) ) = ne n . (15.98) 
It is clear that ^ 0 as > 0. Then we have 


H(W { \Y n ) < H(W u W 2 \Y n ) < ne ni 
H(W 2 \Y n ) < H(W U W 2 \Y n ) < ne n . 


(15.99) 

(15.100) 


We can now bound the rate R\ as 


nRi = H{Wx) 

= I{W x \Y n ) + H{W x \Y n ) 

(a) 

< I(W^Y n ) + ne n 


(15.101) 

(15.102) 


(15.103) 
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< /(X^(Wi); Y n ) + ne n (15.104) 

= HiX'KWO) - HiXKWOlY^ + nen (15.105) 

(C) 

< HiX'KW^X^)) - HiXKW^Y", XUW 2 )) + ne n (15.106) 

=/(^(Wi); Y n \X^_{W 2 )) + ne n (15.107) 

= H{Y n \X n 2 {W 2 )) - H{Y n \X n x {W x ), X n 2 (W 2 )) + ne n (15.108) 

= H(Y"\X n 2 (W 2 ))-f2 好 (W— 1 ， X'KWO, X n 2 (W 2 )) + ne n 

i=l 

(15.109) 

n 

=H(Y n \X^(W 2 )) — J ]// ⑺ |X h ， X 2 ,) + "e„ (15.110) 

i = \ 

(f) n n 

< - H(Y i \X u ,X 2i ) + ne n (15.111) 

i=l i = \ 

(g) n n 

< Y, H{Yi\X u , X 2i ) + ne n (15.112) 

i=l i=\ 

n 

= J2 1 ( x ir,Y i \X 2i ) + ne n , (15.113) 


where 

(a) follows from Fano’s inequality 

(b) follows from the data-processing inequality 

(c) follows from the fact that since W\ and W 2 are independent, 
so are X n x {W x ) and X n 2 {W 2 ), and hence H(X n l (W l )\X^(W 2 ))= 
H{X n x {W x )), and H{X n x (W x )\Y\ X n 2 {W 2 )) < H^W^) by 
conditioning 

(d) follows from the chain rule 

(e) follows from the fact that Yi depends only on Xu and X:i by the 
memoryless property of the channel 

(f) follows from the chain rule and removing conditioning 

(g) follows from removing conditioning 

Hence, we have 

1 n 

Ri < -y]/(X 1/； 7/1X2/) + ^. (15.114) 

n 


Similarly, we have 


15.3 MULTIPLE-ACCESS CHANNEL 


541 


1 n 

Ri<- Yl(X 2i ; Yi\X u ) + € n . (15.115) 

n 

i=l 

To bound the sum of the rates, we have 

n(Ri + R 2 ) = H(W U W 2 ) (15.116) 

= I(W X ,W 2 \ Y n ) + H(W U W 2 \Y n ) (15.117) 

< I(W u W 2 -,Y n ) + ne n (15.118) 

(b) 

< /(r/(Wi),^(W 2 ) ； y n ) + ne„ (15.119) 

=H{Y n )~ H{Y n \Xl{W x ), X n 2 {W 2 )) + ne n (15.120) 

n 


=H(Y n ) - J2 HiYilY 1 - 1 , X n 2 {W 2 )) + ne n 

i = \ 

(15.121) 

= H(Y n )-J^ H(Yi\X u , X 2i ) + ne n (15.122) 

/ =1 

n n 

< Y. H{yi) - Y, X 2i ) + ne n (15.123) 

i=l /=1 

n 

= + ne n , (15.124) 


where 

(a) follows from Fano’s inequality 

(b) follows from the data-processing inequality 

(c) follows from the chain rule 

(d) follows from the fact that Y t depends only on Xu and X:i and is 
conditionally independent of everything else 

(e) follows from the chain rule and removing conditioning 

Hence, we have 


㈣ 在 / ( x 1; ，知 ; 軌 . 


(15.125) 
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The expressions in (15.114) ，（ 15.115)，and (15.125) are the averages of the 
mutual informations calculated at the empirical distributions in column i 
of the codebook. We can rewrite these equations with the new variable Q, 
where Q = i G {1, 2,..., n} with probability 士 . The equations become 


1 

Ri<- Tl(Xu ； Yt\X 2i ) + e n (15.126) 

n 

i=\ 

1 n 

=- Y q \X 2q , Q = i) + e n (15.127) 

U i=l 

= I(X lQ ;Y Q \X 2Q ,Q) + e n (15.128) 

= /(Z 1 ； F|X 2 , 2) + €„ (15.129) 


where X\ = X\q, X 2 = X 2 Q, and Y = Yq are new random variables 
whose distributions depend on Q in the same way as the distributions 
of Xu, X 2 i and 7/ depend on i. Since W\ and W 2 are independent, so are 
Xu(W { ) and X 2i (W 2 ), and hence 

Pr ( 知⑽ = 々，知 (W 2 ) = 々） 

=Pr{X ie = Xl \Q = i}MX 2Q =x 2 \Q = i}. (15.130) 

Hence, taking the limit as n ^ oc, 0, we have the following 

converse: 

/?i</(Xi ； y|z 2 , Q), 

R2<HX 2 ',y\x u q), 

Ri + R2<I(X u X 2 ； Y\Q) (15.131) 

for some choice of joint distribution p(q)p(x\\q)p(x 2 \q)p(y\x\ , X 2 ). As 
in Section 15.3.3, the region is unchanged if we limit the cardinality of 
Q to 4. 

This completes the proof of the converse. □ 

Thus, the achievability of the region of Theorem 15.3.1 was proved in 
Section 15.3.1. In Section 15.3.3 we showed that every point in the region 
defined by (15.96) was also achievable. In the converse, we showed that 
the region in (15.96) was the best we can do, establishing that this is 
indeed the capacity region of the channel. Thus, the region in (15.58) 


15.3 MULTIPLE-ACCESS CHANNEL 


543 


cannot be any larger than the region in (15.96)，and this is the capacity 
region of the multiple-access channel. 


15.3.5 m-User Multiple-Access Channels 

We will now generalize the result derived for two senders to m senders, 
m >2. The multiple-access channel in this case is shown in Figure 15.15. 

We send independent indices w\, u) 2 ,..., w m over the channel from 
the senders 1, 2, … ， m, respectively. The codes, rates, and achievability 
are all defined in exactly the same way as in the two-sender case. 

Let 5 c {1, 2,, m}. Let S c denote the complement of S. Let 
R{S) = R“ and let X{S) = {Xi : i e S}. Then we have the follow¬ 
ing theorem. 

Theorem 15.3.6 The capacity region of the m-user multiple-access 
channel is the closure of the convex hull of the rate vectors satisfying 

R(S) < 7(X(5); Y\X(S C )) for all 5c {1,2,(15.132) 

for some product distribution pi(xi)P2(i2) • •. 

Proof: The proof contains no new ideas. There are now 2 m — 1 terms in 
the probability of error in the achievability proof and an equal number of 
inequalities in the proof of the converse. Details are left to the reader. □ 

In general, the region in (15.132) is a beveled box. 


Xi - ► 

^2 - ^ 


- ► Y 


x m 


p(y|x 1 ,x 2 ,...,x m ) 


FIGURE 15.15. m-user multiple-access channel. 
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15.3.6 Gaussian Multiple-Access Channels 

We now discuss the Gaussian multiple-access channel of Section 15.1.2 
in somewhat more detail. 

Two senders, X\ and X 2 , communicate to the single receiver, Y. The 
received signal at time i is 

Y t = X u + X 2 / + (15.133) 


where {Z/} is a sequence of independent, identically distributed, zero- 
mean Gaussian random variables with variance N (Figure 15.16). We 
assume that there is a power constraint Pj on sender j\ that is, for each 
sender, for all messages, we must have 

1 n 

-(w;,) < Pj, Wj G {1,2,..., 2 nR j}, j = 1,2. (15.134) 

R i=l 

Just as the proof of achievability of channel capacity for the discrete 
case (Chapter 7) was extended to the Gaussian channel (Chapter 9), we 
can extend the proof for the discrete multiple-access channel to the Gaus¬ 
sian multiple-access channel. The converse can also be extended similarly, 
so we expect the capacity region to be the convex hull of the set of rate 
pairs satisfying 


Ri <I(X { ;Y\X 2 ), (15.135) 

R 2 <I(X 2 ； Y\X l ), (15.136) 

Ri + R 2 <I(X u X 2 ;Y) (15.137) 

for some input distribution /i(^i)/ 2 fe) satisfying EX^ < P\ and 
EXj < P 2 . 


z n 



Y n - 


d i4) 


尸 2 


FIGURE 15.16. Gaussian multiple-access channel. 




15.3 MULTIPLE-ACCESS CHANNEL 


545 


Now, we can expand the mutual information in terms of relative 


entropy, and thus 



I{X^Y\X 2 ) 

= h(Y\X 2 )-h(Y\X u X 2 ) 

(15.138) 


=h(X { +Z 2 + Z\X 2 ) - h(X { + X 2 + Z\X U X 2 ) 

(15.139) 


= h(X l + Z\X 2 )-h(Z\X u X 2 ) 

(15.140) 


= h(X { +Z\X 2 )-h(Z) 

(15.141) 


=h(X x + Z) - h(Z) 

(15.142) 


1 

= h{X x + Z)--\og{2ne)N 

(15.143) 


< ]^\og{2ne){Pi + AO - - \og 、 2ne、N 

(15.144) 



(15.145) 


where (15.141) follows from the fact that Z is independent of X\ and 
X 2 , (15.142) from the independence of X\ and X 2 , and (15.144) from 
the fact that the normal maximizes entropy for a given second moment. 
Thus, the maximizing distribution is X\ 〜 A/*(0, P\) and X 2 〜 A/*(0, P 2 ) 
with X\ and X 2 independent. This distribution simultaneously maximizes 
the mutual information bounds in (15.135)-(15.137). 


Definition We define the channel capacity function 

C(x) = ilog(l+x), 


(15.146) 


corresponding to the channel capacity of a Gaussian white-noise channel 
with signal-to-noise ratio x (Figure 15.17). Then we write the bound on 
R\ as 


Ri<C 



(15.147) 


Similarly, 


R 2 <C 


Pi 


(15.148) 
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FIGURE 15.17. Gaussian multiple-access channel capacity. 


and / \ 

/Pi + P 2 \ 

Ri + Ri<cl 1 N z y (15.149) 

These upper bounds are achieved when Xi 〜 A/*(0, 尸 1 ) and X 2 = 
尸 2 ) and define the capacity region. The surprising fact about these 

inequalities is that the sum of the rates can be as large as C ( 尸 1 ^^ 2 )， 
which is that rate achieved by a single transmitter sending with a power 
equal to the sum of the powers. 

The interpretation of the comer points is very similar to the interpre¬ 
tation of the achievable rate pairs for a discrete multiple-access channel 
for a fixed input distribution. In the case of the Gaussian channel, we can 
consider decoding as a two-stage process: In the first stage, the receiver 
decodes the second sender, considering the first sender as part of the noise. 
This decoding will have low probability of error if R 2 < C( p ^ N ). After 
the second sender has been decoded successfully, it can be subtracted out 
and the first sender can be decoded correctly if < C(-^-). Hence, this 
argument shows that we can achieve the rate pairs at the corner points 
of the capacity region by means of single-user operations. This process, 
called onion-peeling, can be extended to any number of users. 

If we generalize this to m senders with equal power, the total rate 
is C (^), which goes to 00 as m —> 00. The average rate per sender, 
goes to 0. Thus, when the total number of senders is very large, 
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so that there is a lot of interference, we can still send a total amount of 
information that is arbitrarily large even though the rate per individual 
sender goes to 0. 

The capacity region described above corresponds to code-division mul¬ 
tiple access (CDMA), where separate codes are used for the different 
senders and the receiver decodes them one by one. In many practical situ¬ 
ations, though, simpler schemes, such as frequency-division multiplexing 
or time-division multiplexing, are used. With frequency-division multiplex¬ 
ing, the rates depend on the bandwidth allotted to each sender. Consider 
the case of two senders with powers P\ and P 2 using nonintersecting 
frequency bands with bandwidths W\ and W 2 , where W\ + W 2 = W (the 
total bandwidth). Using the formula for the capacity of a single-user ban- 
dlimited channel, the following rate pair is achievable: 

^ = Wilog ( 1 + A^k) ! ( 15 . 150) 

^ 2 = W2l0g ( 1 + A ^)' (15 . 151) 

As we vary W\ and W 2 , we trace out the curve as shown in Figure 15.18. 
This curve touches the boundary of the capacity region at one point, 
which corresponds to allotting bandwidth to each channel proportional to 
the power in that channel. We conclude that no allocation of frequency 
bands to radio stations can be optimal unless the allocated powers are 
proportional to the bandwidths. 

In time-division multiple access (TDMA), time is divided into slots, 
and each user is allotted a slot during which only that user will transmit 
and every other user remains quiet. If there are two users, each of power 
P, the rate that each sends when the other is silent is C(P/N). Now if 
time is divided into equal-length slots, and every odd slot is allocated 
to user 1 and every even slot to user 2, the average rate that each user 
achieves is jC(P/N). This system is called naive time-division multiple 
access (TDMA). However, it is possible to do better if we notice that since 
user 1 is sending only half the time, it is possible for him to use twice 
the power during his transmissions and still maintain the same average 
power constraint. With this modification, it is possible for each user to 
send information at a rate ^C(2P/N). By varying the lengths of the 
slots allotted to each sender (and the instantaneous power used during the 
slot), we can achieve the same capacity region as FDMA with different 
bandwidth allocations. 

As Figure 15.18 illustrates, in general the capacity region is larger than 
that achieved by time- or frequency-division multiplexing. But note that 
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FIGURE 15.18. Gaussian multiple-access channel capacity with FDMA and TDMA. 

the multiple-access capacity region derived above is achieved by use of 
a common decoder for all the senders. However, it is also possible to 
achieve the capacity region by onion-peeling, which removes the need 
for a common decoder and instead, uses a sequence of single-user codes. 
CDMA achieves the entire capacity region, and in addition, allows new 
users to be added easily without changing the codes of the current users. 
On the other hand, TDMA and FDMA systems are usually designed for 
a fixed number of users and it is possible that either some slots are empty 
(if the actual number of users is less than the number of slots) or some 
users are left out (if the number of users is greater than the number 
of slots). However, in many practical systems, simplicity of design is 
an important consideration, and the improvement in capacity due to the 
multiple-access ideas presented earlier may not be sufficient to warrant 
the increased complexity. 

For a Gaussian multiple-access system with m sources with powers 
P\, P 2 ,..., P m and ambient noise of power N ， we can state the equivalent 
of Gauss’s law for any set S in the form 


Ri = total rate of information flow from S (15.152) 

ieS 



N 


(15.153) 
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15.4 ENCODING OF CORRELATED SOURCES 

We now turn to distributed data compression. This problem is in many 
ways the data compression dual to the multiple-access channel problem. 
We know how to encode a source X. A rate R > H(X) is sufficient. Now 
suppose that there are two sources (Z, Y) ~ p{x,y). A rate H(X ， Y) 
is sufficient if we are encoding them together. But what if the X and 
Y sources must be described separately for some user who wishes to 
reconstruct both X and Y1 Clearly, by separately encoding X and Y, it is 
seen that a rate R = R x + R y > H{X) + H(Y) is sufficient. However, in 
a surprising and fundamental paper by Slepian and Wolf [502], it is shown 
that a total rate R = H(X, Y) is sufficient even for separate encoding of 
correlated sources. 

Let (Zi, Fi), (X 2 , Y 2 ),... be a sequence of jointly distributed random 
variables i.i.d . 〜 p(x, y). Assume that the X sequence is available at a 
location A and the Y sequence is available at a location B. The situation 
is illustrated in Figure 15.19. 

Before we proceed to the proof of this result, we will give a few 
definitions. 

Definition A ((2 nRl , 2 nRl ), n) distributed source code for the joint 
source (X, Y) consists of two encoder maps, 

fi ： x n ^ {1,2, (15.154) 

f 2 ： y n ^{l,2,...,2 nR2 }, (15.155) 



FIGURE 15.19. Slepian-Wolf coding. 
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and a decoder map, 

g : {1,2, 2 nRi ) x {1,2,..., 2 nRl } ^ X n x y n . (15.156) 

Here f\(X n ) is the index corresponding to X n , f 2 (Y n ) is the index cor¬ 
responding to Y n , and (/?i, R 2 ) is the rate pair of the code. 

Definition The probability of error for a distributed source code is 
defined as 

p” = P(g(fi(X n ), f 2 {Y n )) ^ {X\ Y n )). (15.157) 

Definition A rate pair (/?i, R 2 ) is said to be achievable for a distributed 
source if there exists a sequence of ((2 nRl , 2 nRl ), n) distributed source 
codes with probability of error > 0. The achievable rate region is 

the closure of the set of achievable rates. 

Theorem 15.4.1 (Slepian-Wolf) For the distributed source coding 
problem for the source (X, Y) drawn i.i.d 〜 p(x, y) f the achievable rate 
region is given by 

Ri > H(X\Y), 

R 2 > H(Y\X), 

Ri + R 2 >H(X, Y). 

Let us illustrate the result with some examples. 

Example 15.4 .7 Consider the weather in Gotham and Metropolis. For 
the purposes of our example, we assume that Gotham is sunny with prob¬ 
ability 0.5 and that the weather in Metropolis is the same as in Gotham 
with probability 0.89. The joint distribution of the weather is given as 
follows: 


p(x, y) 

Metropolis 

Rain 

Shine 

Gotham 



Rain 

0.445 

0.055 

Shine 

0.055 

0.445 


(15.158) 

(15.159) 

(15.160) 
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Assume that we wish to transmit 100 days of weather information to the 
National Weather Service headquarters in Washington. We could send all 
the 100 bits of the weather in both places, making 200 bits in all. If we 
decided to compress the information independently, we would still need 
1007/(0.5) = 100 bits of information from each place, for a total of 200 
bits. If, instead, we use Slepian-Wolf encoding, we need only H{X) + 
H{Y\X) = 100//(0.5) + 100//(0.89) = 100 + 50 = 150 bits total. 

Example 15.4.2 Consider the following joint distribution: 


p{u, v) 

0 1 

0 

1 1 

3 3 

1 

0 \ 


In this case, the total rate required for the transmission of this 
source is H(JJ) + H(V\U) = log3 = 1.58 bits rather than the 2 bits that 
would be needed if the sources were transmitted independently without 
Slepian-Wolf encoding. 

15.4.1 Achievability of the Slepian-Wolf Theorem 

We now prove the achievability of the rates in the Slepian-Wolf theorem. 
Before we proceed to the proof, we introduce a new coding procedure 
using random bins. The essential idea of random bins is very similar to 
hash functions: We choose a large random index for each source sequence. 
If the set of typical source sequences is small enough (or equivalently, the 
range of the hash function is large enough), then with high probability, 
different source sequences have different indices, and we can recover the 
source sequence from the index. 

Let us consider the application of this idea to the encoding of a single 
source. In Chapter 3 the method that we considered was to index all 
elements of the typical set and not bother about elements outside the 
typical set. We will now describe the random binning procedure, which 
indexes all sequences but rejects untypical sequences at a later stage. 

Consider the following procedure: For each sequence X n , draw an index 
at random from {1, 2,..., 2 nR }. The set of sequences X n which have the 
same index are said to form a bin ，since this can be viewed as first laying 
down a row of bins and then throwing the X n 's at random into the bins. 
For decoding the source from the bin index, we look for a typical X n 
sequence in the bin. If there is one and only one typical X n sequence 
in the bin, we declare it to be the estimate X n of the source sequence; 
otherwise, an error is declared. 
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The above procedure defines a source code. To analyze the probability 
of error for this code, we will now divide the X n sequences into two 
types, typical sequences and nontypical sequences. If the source sequence 
is typical, the bin corresponding to this source sequence will contain at 
least one typical sequence (the source sequence itself). Hence there will 
be an error only if there is more than one typical sequence in this bin. If 
the source sequence is nontypical, there will always be an error. But if 
the number of bins is much larger than the number of typical sequences, 
the probability that there is more than one typical sequence in a bin is 
very small, and hence the probability that a typical sequence will result 
in an error is very small. 

Formally, let f(X n ) be the bin index corresponding to X n . Call the 
decoding function g. The probability of error (averaged over the random 
choice ofcodes /) is 

P(g(f(X)) ^X) < P(X^ A^) + ^ X : g fix') 

X 


= f(x))p(x) 

y + E E P(/(xO = f(x))p(x) (15.161) 


x X ， e A 》） 

X ， # X 


< e + ^ ^ 2~ nR p(x) 
x x'eAi n) 

(15.162) 

=e + ^ 2~ nR ^ p(x) 

x ， e4") x 

(15.163) 

< e + 2~ nR 

x'eAi n) 

(15.164) 

< e + 2< H{ - x ^2~ nR 

(15.165) 

< 2e 

(15.166) 


if R > H(X) + € and n is sufficiently large. Hence, if the rate of the code 
is greater than the entropy, the probability of error is arbitrarily small and 
the code achieves the same results as the code described in Chapter 3. 

The above example illustrates the fact that there are many ways to 
construct codes with low probabilities of error at rates above the entropy 
of the source; the universal source code is another example of such a code. 
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Note that the binning scheme does not require an explicit characterization 
of the typical set at the encoder; it is needed only at the decoder. It is 
this property that enables this code to continue to work in the case of a 
distributed source, as illustrated in the proof of the theorem. 

We now return to the consideration of the distributed source coding and 
prove the achievability of the rate region in the Slepian-Wolf theorem. 

Proof: (Achievability in Theorem 15.4.1). The basic idea of the proof is 
to partition the space of PC 1 into 2 nRl bins and the space of y n into 2 nRl 
bins. 

Random code generation: Assign every x e X n to one of 2 nRl bins 
independently according to a uniform distribution on {1,2,..., 2 nRl }. 
Similarly, randomly assign every y ^ y n to one of 2 nRl bins. Reveal 
the assignments f\ and to both the encoder and the decoder. 

Encoding: Sender 1 sends the index of the bin to which X belongs. 
Sender 2 sends the index of the bin to which Y belongs. 

Decoding: Given the received index pair (io, jo )，declare (x, y) = (x, y) 
if there is one and only one pair of sequences (x, y) such that f\ (x) = /o, 
y* 2 (y) = jo and (x, y) e A^\ Otherwise, declare an error. The scheme 
is illustrated in Figure 15.20. The set of X sequences and the set of Y 
sequences are divided into bins in such a way that the pair of indices 
specifies a product bin. 



2nH(X, Y) 

jointly typical pairs 
(x n ,y n ) 

FIGURE 15.20. Slepian-Wolf encoding: the jointly typical pairs are isolated by the product 
bins. 
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Probability of error. Let (X“ F/) 〜 p(x, y). Define the events 

£ 0 = {(X ， Y)_4 ")}， （ 15.167) 

E x = {3x' # X : /Kx') = /i(X) and (x r , Y) g A^}, (15.168) 

五 2 = {3yV Y : / 2 (y') = / 2 (Y) and (X,y f ) e A^}, (15.169) 

and 

En = {3(x’ ， y’） : xV X, y V Y, /i(x') 

= /i(X), / 2 (y') = / 2 (Y) and (x' ， 〆） e A^}- (15.170) 

Here X, Y, /i, and are random. We have an error if (X, Y) is not in 
or if there is another typical pair in the same bin. Hence by the union 
of events bound, 

P^ n) = P(£ 0 u U £ 2 u £ 12 ) (15.171) 

< P(Eo) + P(E 1 ) + P(E 2 ) + P(E n ). (15.172) 

First consider Eq. By the AEP, P(Eo) —> 0 and hence for n sufficiently 
large, P(Eo) < €. To bound P(E\), we have 

P(E { ) = P{3x^ ^ X : /iCxO = MX), and (x^, Y) g A^} (15.173) 

=[p(x, y)P{3x f ^ x : /i(xO = /Kx)’ （ x' ， y) e A^} 

(x,y) 

(15.174) 

< J2 y) J2 = /i(x» (15.175) 

(x’y) x’ # x 

(x^, y) g 

= Y / p(x,y)2- nR ^\A e (X\y)\ (15.176) 

(x,y) 

< 2~ nRi 2 n{Hixm+€) (by Theorem 15.2.2 ), (15.177) 

which goes to 0 if /?i > H(X\Y). Hence for sufficiently large n, P{E\) < 
€. Similarly, for sufficiently large n, 尸（ £* 2 ) < 6 if /?2 > H(Y\X) and 
P(E n ) < € ifRi + Ri > H(X, Y). Since the average probability of error 
is < 46, there exists at least one code (/*, / 2 *, g*) with probability of error 
< 4e. Thus, we can construct a sequence of codes with > 0, and 

the proof of achievability is complete. □ 
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15.4.2 Converse for the Slepian-Wolf Theorem 

The converse for the Slepian-Wolf theorem follows obviously from the 
results for a single source, but we will provide it for completeness. 

Proof: (Converse to Theorem 15.4.1). As usual, we begin with Fano’s 
inequality. Let /i, / 2 , g be fixed. Let /o = fi(X n ) and Jo = f 2 (Y n ). Then 

H{X n , Y n \I 0 ,J 0 ) < P, (n, «(log|^1 + log |3^|) +1 = ne n , (15.178) 


where > 0 as n —> oc. Now adding conditioning, we also have 



H(X n \Y n ,Io,Jo)<ne n , 

(15.179) 

and 

H(Y n \X n ,I 0 ,J 0 )<ne n . 

(15.180) 

We can write 

a chain of inequalities 


n(Ri + R 2 ) > H(I 0 , Jo) 

(15.181) 


= I(X n ,Y n -I 0 ,J 0 ) + H(I 0 , J 0 \X n , Y n ) 

(15.182) 


= I(X n ,Y n ;I 0 ,Jo) 

(15.183) 


= H(X n ,Y n )-H(X n ,Y n \I 0 ,J 0 ) 

(15.184) 


(C) 

> H(X\Y n )-ne n 

(15.185) 


=nH{X, Y)-ne n , 

(15.186) 

where 



(a) follows 

from the fact that Iq g {1,2,..., 2 nRl } 

and Jo G 


{ 1 , 2 ,..., 2 ^ 2 } 

(b) follows from the fact the Iq is a function of X n and Jo is a function 
of Y n 


(c) follows from Fano’s inequality (15.178) 

(d) follows from the chain rule and the fact that (X“ Yi) are i.i.d. 
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Similarly, using (15.179), we have 


(a) 

nRi > H(I 0 ) 

(15.187) 

> H(I 0 \Y n ) 

(15.188) 

= I(X n -,I 0 \Y n ) + H(I 0 \X n ,Y n ) 

(15.189) 

I(X n ;I 0 \Y n ) 

(15.190) 

=H(X n \Y n )- H(X n \I 0 ,J 0 , Y n ) 

(15.191) 

(C) 

> H{X n \Y n )-fl€ n 

(15.192) 

=nH(X\Y)-ne n , 

(15.193) 

where the reasons are the same as for the equations above. Similarly, we 
can show that 

nR 2 > nH{Y\X) -ne n . 

(15.194) 

Dividing these inequalities by n and taking the limit as n ^ 
the desired converse. 

- oc, we have 
□ 

The region described in the Slepian-Wolf theorem is 
Figure 15.21. 

illustrated in 

15.4.3 Slepian-Wolf Theorem for Many Sources 


The results of Section 15.4.2 can easily be generalized to many sources. 
The proof follows exactly the same lines. 

Theorem 15.4.2 Let (Xu, X 2 “ ... ， X m i) be i.i.d. 〜 p(x\, 々， … ， x m ). 
Then the set of rate vectors achievable for distributed source coding with 
separate encoders and a common decoder is defined by 

R(S) > H(X(S)\X(S C )) 

(15.195) 


for all S ^ {1,2,, m}, where 

R(S) = J2 Ri 


ieS 


and X(S) = {Xj : j e S}. 


(15.196) 
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R 2 ‘ 

H(Y)- 

H(Y\X) - 


0 H(X\Y) H(X) 

FIGURE 15.21. Rate region for Slepian-Wolf encoding. 



Proof: The proof is identical to the case of two variables and is 
omitted. □ 

The achievability of Slepian-Wolf encoding has been proved for an 
i.i.d. correlated source, but the proof can easily be extended to the case 
of an arbitrary joint source that satisfies the AEP; in particular, it can 
be extended to the case of any jointly ergodic source [122]. In these 
cases the entropies in the definition of the rate region are replaced by the 
corresponding entropy rates. 


15.4.4 ■nterpretation of Slepian-Wolf Coding 

We consider an interpretation of the corner points of the rate region in 
Slepian-Wolf encoding in terms of graph coloring. Consider the point 
with rate R\ = H{X), Ri = H{Y\X). Using nH(X) bits, we can encode 
X n efficiently, so that the decoder can reconstruct X n with arbitrarily low 
probability of error. But how do we code Y n with nH(Y\X) bits? Looking 
at the picture in terms of typical sets, we see that associated with every 
X n is a typical “fan” of P sequences that are jointly typical with the 
given X n as shown in Figure 15.22. 

If the Y encoder knows X n ，the encoder can send the index of the Y n 
within this typical fan. The decoder, also knowing X n ， can then construct 
this typical fan and hence reconstruct Y n . But the Y encoder does not 
know X n . So instead of trying to determine the typical fan, he randomly 
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colors all Y n sequences with 2 nRl colors. If the number of colors is high 
enough, then with high probability all the colors in a particular fan will 
be different and the color of the Y n sequence will uniquely define the 
Y n sequence within the X n fan. If the rate R 2 > H(Y\X), the number of 
colors is exponentially larger than the number of elements in the fan and 
we can show that the scheme will have an exponentially small probability 
of error. 


15.5 DUALITY BETWEEN SLEPIAN-WOLF ENCODING 
AND MULTIPLE-ACCESS CHANNELS 

With multiple-access channels, we considered the problem of sending 
independent messages over a channel with two inputs and only one output. 
With Slepian-Wolf encoding, we considered the problem of sending a 
correlated source over a noiseless channel, with a common decoder for 
recovery of both sources. In this section we explore the duality between 
the two systems. 

In Figure 15.23, two independent messages are to be sent over the 
channel as and sequences. The receiver estimates the messages 
from the received sequence. In Figure 15.24 the correlated sources are 
encoded as “independent” messages i and j. The receiver tries to estimate 
the source sequences from knowledge of i and j. 

In the proof of the achievability of the capacity region for the multiple- 
access channel, we used a random map from the set of messages to the 
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FIGURE 15.24. Correlated source encoding. 


sequences and In the proof for Slepian-Wolf coding, we used a 
random map from the set of sequences X n and Y n to a set of messages. 
In the proof of the coding theorem for the multiple-access channel, the 
probability of error was bounded by 

< 6 + ^ Pr(codeword jointly typical with sequence received) 

codewords 

(15.197) 

=^+ J2 2 ~ nh + J2 2 ~ nl2 + J2 2 ~ nI ^ 

l nR \ terms 2 nR 2 terms 2 ” (尺 1 過 ） terms 


(15.198) 











560 


NETWORK INFORMATION THEORY 


where 6 is the probability the sequences are not typical, /?/ are the rates 
corresponding to the number of codewords that can contribute to the 
probability of error, and // is the corresponding mutual information that 
corresponds to the probability that the codeword is jointly typical with 
the received sequence. 

In the case of Slepian-Wolf encoding, the corresponding expression 
for the probability of error is 


心 e+ ^ 

Pr( have the same codeword) (15.199) 

jointly typical sequences 


e + ^ 2~ nRl + ^ 2~ nRl + ^ 2 - n(R i +R 2 )， 

2 nH l terms 2 nIi 2 terms 2 nH 3 terms 


(15.200) 

where again the probability that the constraints of the AEP are not satisfied 
is bounded by €, and the other terms refer to the various ways in which 
another pair of sequences could be jointly typical and in the same bin as 
the given source pair. 

The duality of the multiple-access channel and correlated source encod¬ 
ing is now obvious. It is rather surprising that these two systems are duals 
of each other; one would have expected a duality between the broadcast 
channel and the multiple-access channel. 


15.6 BROADCAST CHANNEL 

The broadcast channel is a communication channel in which there is one 
sender and two or more receivers. It is illustrated in Figure 15.25. The 
basic problem is to find the set of simultaneously achievable rates for 
communication in a broadcast channel. Before we begin the analysis, let 
us consider some examples. 


Example 15.6.1 (TV station) The simplest example of the broadcast 
channel is a radio or TV station. But this example is slightly degenerate 
in the sense that normally the station wants to send the same informa¬ 
tion to everybody who is tuned in; the capacity is essentially max〆；^) 
min/ 1{X\ Yi )，which may be less than the capacity of the worst receiver. 
But we may wish to arrange the information in such a way that the bet¬ 
ter receivers receive extra information, which produces a better picture 
or sound, while the worst receivers continue to receive more basic infor¬ 
mation. As TV stations introduce high-definition TV (HDTV), it may be 
necessary to encode the information so that bad receivers will receive the 
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FIGURE 15.25. Broadcast channel. 


regular TV signal, while good receivers will receive the extra informa¬ 
tion for the high-definition signal. The methods to accomplish this will 
be explained in the discussion of the broadcast channel. 

Example 75.6.2 {Lecturer in classroom) A lecturer in a classroom is 
communicating information to the students in the class. Due to differences 
among the students, they receive various amounts of information. Some of 
the students receive most of the information; others receive only a little. In 
the ideal situation, the lecturer would be able to tailor his or her lecture in 
such a way that the good students receive more information and the poor 
students receive at least the minimum amount of information. However, a 
poorly prepared lecture proceeds at the pace of the weakest student. This 
situation is another example of a broadcast channel. 

Example 15.6.3 (Orthogonal broadcast channels) The simplest broad¬ 
cast channel consists of two independent channels to the two receivers. 
Here we can send independent information over both channels, and we 
can achieve rate R\ to receiver 1 and rate R 2 to receiver 2 if R\ < C\ and 
/?2 < C 2 . The capacity region is the rectangle shown in Figure 15.26. 

Example 15.6.4 (Spanish and Dutch speaker) To illustrate the idea of 
superposition, we will consider a simplified example of a speaker who can 
speak both Spanish and Dutch. There are two listeners: One understands 
only Spanish and the other understands only Dutch. Assume for simplicity 
that the vocabulary of each language is 2 20 words and that the speaker 
speaks at the rate of 1 word per second in either language. Then he 
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R 2 

C2 一 


U ^ R, 

FIGURE 15.26. Capacity region for two orthogonal broadcast channels. 


can transmit 20 bits of information per second to receiver 1 by speaking 
to her all the time; in this case, he sends no information to receiver 2 . 
Similarly, he can send 20 bits per second to receiver 2 without sending 
any information to receiver 1. Thus, he can achieve any rate pair with 
Ri + R 2 = 20 by simple time-sharing. But can he do better? 

Recall that the Dutch listener, even though he does not understand 
Spanish, can recognize when the word is Spanish. Similarly, the Spanish 
listener can recognize when Dutch occurs. The speaker can use this to 
convey information; for example, if the proportion of time he uses each 
language is 50%, then of a sequence of 100 words, about 50 will be 
Dutch and about 50 will be Spanish. But there are many ways to order the 

Spanish and Dutch words; in fact, there are about ( 50 °) ^ 2 W0H ^ ways 
to order the words. Choosing one of these orderings conveys information 
to both listeners. This method enables the speaker to send information at 
a rate of 10 bits per second to the Dutch receiver, 10 bits per second to 
the Spanish receiver, and 1 bit per second of common information to both 
receivers, for a total rate of 21 bits per second, which is more than that 
achievable by simple timesharing. This is an example of superposition of 
information. 

The results of the broadcast channel can also be applied to the case 
of a single-user channel with an unknown distribution. In this case, the 
objective is to get at least the minimum information through when the 
channel is bad and to get some extra information through when the channel 
is good. We can use the same superposition arguments as in the case of 
the broadcast channel to find the rates at which we can send information. 
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15.6.1 Definitions for a Broadcast Channel 

Definition A broadcast channel consists of an input alphabet X and 
two output alphabets ， 3^1 and 3 ^ 2 ， and a probability transition function 
p(y\, y 2 \x). The broadcast channel will be said to be memoryless if 

p(y", yl\x n ) = YYi=i p(yu,y2i\xi). 

We define codes, probability of error, achievability, and capacity regions 
for the broadcast channel as we did for the multiple-access channel. A 
((2 nRl , 2 nRl ), n) code for a broadcast channel with independent informa¬ 
tion consists of an encoder, 

X : ({1, 2,..., 2 nRl ) x {1,2,2 nR 2}) ^ X n , (15.201) 

and two decoders, 

gl : X ^ {\,2,...,2 nRl } (15.202) 

and 

g 2 ： y^ {l,2,...,2 nR ^}. (15.203) 

We define the average probability of error as the probability that the 
decoded message is not equal to the transmitted message; that is, 

P e (n) = PigdY'O ^ Wi or g 2 {Y^) ^ W 2 ), (15.204) 

where (W\, W 2 ) are assumed to be uniformly distributed over 2 nRl x 2 nRl . 

Definition A rate pair (/?i, R 2 ) is said to be achievable for the broad¬ 
cast channel if there exists a sequence of ((2 ,l/?1 , 2 nRl ), n) codes with 

p^ n) 0 . 

We will now define the rates for the case where we have common 
information to be sent to both receivers. A ((2 nR °, 2 nRl , 2 nRl ), n) code 
for a broadcast channel with common information consists of an encoder, 


X : ({1 ， 2,..., 2, x {1,2,..., 2, x {1,2,..., 2 nfi2 }) ^ X n , 


and two decoders, 




(15.205) 

gi : y" - 

^ {1,2,. 

..,2 nRo ] x {1,2,. 

..,2 nRi ] 

(15.206) 

and 

§ 2 : y n 2 — 

- {1,2,.. 

_ ，2 , x {1,2,. 

2 nR2 }. 

(15.207) 
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Assuming that the distribution on (VKo, W 2 ) is uniform, we can define 
the probability of error as the probability that the decoded message is not 
equal to the transmitted message: 

P” = MgiOT) # (W 0 , Wi) or g 2 (Z n ) ^ (Wo, W 2 )). (15.208) 

Definition A rate triple (Ro, R\, R 2 ) is said to be achievable for the 
broadcast channel with common information if there exists a sequence of 
((2 ” r 。， 2 nR \2 nR ^,n) codes with P^ n) 0. 

Definition The capacity region of the broadcast channel is the closure 
of the set of achievable rates. 

We observe that an error for receiver depends only the distribution 
p(x n , y^) and not on the joint distribution p(x n , y 1 }, yg). Thus, we have 
the following theorem: 

Theorem 15.6.1 The capacity region of a broadcast channel depends 
only on the conditional marginal distributions p(y\\x) and p{yi\^)- 

Proof: See the problems. □ 

15.6.2 Degraded Broadcast Channels 

Definition A broadcast channel is said to be physically degraded if 
p(yu yi\x) = p(yi\x)p(y 2 \yi). 

Definition A broadcast channel is said to be stochastically degraded if 
its conditional marginal distributions are the same as that of a physically 
degraded broadcast channel; that is, if there exists a distribution p\y 2 \y\) 
such that 

p{y 2 \x) = ^ p(yi\x)p\y 2 \yi)- (15.209) 

yi 


Note that since the capacity of a broadcast channel depends only on the 
conditional marginals, the capacity region of the stochastically degraded 
broadcast channel is the same as that of the corresponding physically 
degraded channel. In much of the following, we therefore assume that the 
channel is physically degraded. 
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15.6.3 Capacity Region for the Degraded Broadcast Channel 

We now consider sending independent information over a degraded broad¬ 
cast channel at rate R\ to Y\ and rate /?2 to Y 2 . 

Theorem 15.6.2 The capacity region for sending independent infor¬ 
mation over the degraded broadcast channel X —> Fi —> F2 is the convex 
hull of the closure of all {R\, R 2 ) satisfying 


(15.210) 

(15.211) 


R2SI(U ， Y 2 )， 
/?! </(X;Fi|[/) 


for some joint distribution p(u)p(x\u)p(y\, y 2 \x) f where the auxiliary ran¬ 
dom variable U has cardinality bounded by \U\ < min{|A , |, |^Vi|, 

Proof: (The cardinality bounds for the auxiliary random variable U are 
derived using standard methods from convex set theory and are not dealt 
with here.) We first give an outline of the basic idea of superposition 
coding for the broadcast channel. The auxiliary random variable U will 
serve as a cloud center that can be distinguished by both receivers Y\ 
and Y 2 . Each cloud consists of 2 nRl codewords X n distinguishable by the 
receiver Y\. The worst receiver can only see the clouds, while the better 
receiver can see the individual codewords within the clouds. The formal 
proof of the achievability of this region uses a random coding argument: 
Fix p(u) and p{x\u). 

Random codebook generation: Generate 2 nRl independent codewords 
of length n, U(u ； 2 )， ⑴ 2 e U, 2, • • • ， 2 nRl }, according to H/Li p(w/). For 
each codeword U(u ； 2 ), generate 2 nRl independent codewords X(u ； i, W 2 ) 
according to nf=i P( x i\ u i( w 2 ))- Here u(/) plays the role of the cloud 
center understandable to both Y\ and Y 2 , while x(/, j) is the 7th satellite 
codeword in the ith cloud. 

Encoding: To send the pair (W\, W 2 ), send the corresponding codeword 
X(W U W 2 ). A A 

Decoding: Receiver 2 determines the unique W 2 such that (U(l^2), 
Y2) G A^ n \ If there are none such or more than one such, an error is 
declared. 

Receiver 1 looks for the unique (IVi, W 2 ) such that (U ( 你 2) ， X(W\, W 2 ), 
Yi) g A^\ If there are none such or more than one such, an error is 
declared. 

Analysis of the probability of error. By the symmetry of the code gen¬ 
eration, the probability of error does not depend on which codeword was 
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sent. Hence, without loss of generality, we can assume that the mes¬ 
sage pair W 2 ) = (1, 1) was sent. Let 尸 （•）denote the conditional 
probability of an event given that (1,1) was sent. 

Since we have essentially a single-user channel from U to Y 2 , we will 
be able to decode the U codewords with a low probability of error if 


R2 < I (U; Y2). To prove this, we define the events 

£： Fi - ={(U(0,Y 2 ) g A^}- (15.212) 

Then the probability of error at receiver 2 is 

P e (n \2) = 尸 (辟 ! U U 如） (15-213) 

^1 

<P(E c Yl ) + J2P(E Yi ) (15.214) 

<€+ 2 nR2 2- n(I{u;Y2) - 2€) (15.215) 

<26 (15.216) 

if n is large enough and R2 < I (U.，Y2), where (15.215) follows from the 
AEP. Similarly, for decoding for receiver 1, we define the events 

(15.217) 

Enj = {(U(/) ， X(/ ， 刀 ， YD e 4 h) }， (15.218) 


where the tilde refers to events defined at receiver 1. Then we can bound 
the probability of error as 

e) ⑴ = P^iU i niUU^UU （ 15.219) 

V m / 

< 尸(云 ^) + P(E c yu ) + J^ P(E Yi ) + J^ P(E Ylj ). (15.220) 


By the same arguments as for receiver 2, we can bound P(Eyi) < 
2~n(i(u-,Yi)-3€) t Hence, the third term goes to 0 if R 2 < I(U; Y\). But 
by the data-processing inequality and the degraded nature of the chan¬ 
nel, I{U ; Y\) > I{U ; Y 2 ), and hence the conditions of the theorem imply 
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that the third term goes to 0. We can also bound the fourth term in the 
probability of error as 


P(E Ylj ) = P((U(1), X(l ， j), YD e Af } ) (15.221) 

= J2 尸 ((U ⑴， x(l ， ^Y!)) (15.222) 

(U,X,Y!)eAi n) 

= J2 尸 0^1)) 尸 ( 又（ 1 ， _/)|1!(1)) 尸 (YillKl)) (15.223) 


< 


2-n(H(U)-€)2~n(H(X\U)-€)2 


-n{H(Y x \U)-€) 


(V,X,Y x )eA, 


(«) 


(15.224) 


< 2^(H(U,X,Yi) + €) 2-n(H(U)-€)2~n(H(X\U)-€)2-n(H(Yi\U)-€) 

- (15.225) 


2~n(I(X-,Y l \U)-4€) 


(15.226) 


Hence, if R\ < I{X\ Y\\U), the fourth term in the probability of error 
goes to 0. Thus, we can bound the probability of error 


p(«)(l) < ^ ^ 2 心 22— + 2 n &2— w (’(^ 叨一扣 ） （15 227) 

<46 (15.228) 


if n is large enough and R 2 < I(U; Y\) and R\ < I(X; Y\\U). The above 
bounds show that we can decode the messages with total probability 
of error that goes to 0. Hence, there exists a sequence of good ((2 nRx , 
2 nRl ), n) codes C* with probability of error going to 0. With this, we com¬ 


plete the proof of the achievability of the capacity region for the degraded 
broadcast channel. Gallager’s proof [225] of the converse is outlined in 


Problem 15.11. 


□ 


So far we have considered sending independent information to each 
receiver. But in certain situations, we wish to send common information 
to both receivers. Let the rate at which we send common information be 
Rq. Then we have the following obvious theorem: 


Theorem 15.6.3 If the rate pair (R\, R 2 ) is achievable for a broadcast 
channel with independent information, the rate triple (Ro, R\ — Hq, R 2 — 
Ro) with a common rate Rq is achievable, provided that Ro < 
min(/?i, R 2 ). 
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In the case of a degraded broadcast channel, we can do even better. 
Since by our coding scheme the better receiver always decodes all the 
information that is sent to the worst receiver, one need not reduce the 
amount of information sent to the better receiver when we have common 
information. Hence, we have the following theorem: 

Theorem 15.6.4 If the rate pair (/?i, R 2 ) is achievable for a degraded 
broadcast channel, the rate triple (/?o, R\, R 2 — Ro) is achievable for the 
channel with common information, provided that Rq < R 2 . 

We end this section by considering the example of the binary symmetric 
broadcast channel. 

Example 75.6.5 Consider a pair of binary symmetric channels with 
parameters p\ and P 2 that form a broadcast channel as shown in Fig¬ 
ure 15.27. Without loss of generality in the capacity calculation, we can 
recast this channel as a physically degraded channel. We assume that 
P\ < P 2 < Then we can express a binary symmetric channel with 
parameter P 2 as a cascade of a binary symmetric channel with parameter 
p\ with another binary symmetric channel. Let the crossover probability 
of the new channel be a. Then we must have 

Pi(l -a) + (l- pi)a = p 2 (15.229) 


0 


x 


1 



0 




1 


0 


A 

1 


FIGURE 15.27. Binary symmetric broadcast channel. 
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1 — 1 — Pi ^ - a 



FIGURE 15.28. Physically degraded binary symmetric broadcast channel. 


or 

P2 — Pi ^ 

a = - . (15.230) 

1 - 2pi 

We now consider the auxiliary random variable in the definition of the 
capacity region. In this case, the cardinality of U is binary from the bound 
of the theorem. By symmetry, we connect [/ to X by another binary 
symmetric channel with parameter as illustrated in Figure 15.28. 

We can now calculate the rates in the capacity region. It is clear by sym¬ 
metry that the distribution on U that maximizes the rates is the uniform 
distribution on {0, 1}, so that 



nU-Y 2 ) = H(Y 2 )-H(Y 2 \U) 

(15.231) 


= l-H^* p 2 ), 

(15.232) 

where 


P * P2 = ~ Pi) + (1 - P)P2- 

(15.233) 

Similarly, 


/(X; 7i|t/) = HiY^U)- H{Y x \X, U) 

(15.234) 


=H{Y,\U) - H{Y,\X) 

(15.235) 


=H (卜 Pl )-H( Pl ), 

(15.236) 

where 


^ * Pi = ^(1 - Pi) + (1 - P)Pi- 

(15.237) 


Plotting these points as a function of P ，we obtain the capacity region 
in Figure 15.29. When /3 = 0, we have maximum information transfer 
to Y 2 [i.e., R 2 = l — H(p 2 ) and R\ = 0]. When 卢 =|, we have maxi¬ 
mum information transfer to Y\ [i.e., R\ = l — and no information 

transfer to Y 2 . These values of ^ give us the corner points of the rate 
region. 
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^ 〜 Af(Q ， N') Z f 2 -Af(0,N 2 -N : ) 



FIGURE 15.30. Gaussian broadcast channel. 


Example 75.6.6 (Gaussian broadcast channel) The Gaussian broad¬ 
cast channel is illustrated in Figure 15.30. We have shown it in the case 
where one output is a degraded version of the other output. Based on 
the results of Problem 15.10, it follows that all scalar Gaussian broadcast 
channels are equivalent to this type of degraded channel. 


Y x = X + Z U (15.238) 

Y 2 = X + Z 2 = Y l + Z f 2 , (15.239) 


where Zi 〜 A/*(0, N x ) and Z; 〜 A/*(0, N 2 - N { ). 

Extending the results of this section to the Gaussian case, we can show 
that the capacity region of this channel is given by 


Ri<C 


aP\ 


r 2 <c 


ni-a)P\ 

\aP + N 2 ) 


(15.240) 

(15.241) 
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where a may be arbitrarily chosen (0 < a < 1). The coding scheme that 
achieves this capacity region is outlined in Section 15.1.3. 

15.7 RELAY CHANNEL 

The relay channel is a channel in which there is one sender and one 
receiver with a number of intermediate nodes that act as relays to help 
the communication from the sender to the receiver. The simplest relay 
channel has only one intermediate or relay node. In this case the channel 
consists of four finite sets X, X \, y, and y\ and a collection of probability 
mass functions p(y, y\\x, x\) on J x 3^， one for each (x, x\) e X x X\. 
The interpretation is that x is the input to the channel and y is the output 
of the channel, y\ is the relay’s observation, and x\ is the input symbol 
chosen by the relay, as shown in Figure 15.31. The problem is to find the 
capacity of the channel between the sender X and the receiver Y. 

The relay channel combines a broadcast channel (X to Y and Y\) and a 
multiple-access channel (X and X\ to Y). The capacity is known for the 
special case of the physically degraded relay channel. We first prove an 
outer bound on the capacity of a general relay channel and later establish 
an achievable region for the degraded relay channel. 

Definition A (2 nR , n) code for a relay channel consists of a set of 
integers W = {1, 2 ,..., 2 nR }, an encoding function 


X:{l,2,...,2 nR }^X n , 


(15.242) 


a set of relay functions {//}^ =1 such that 


xu = Y\ 2 , • • •, Yu-i), I < i < n, 


(15.243) 


and a decoding function, 


g ： y n ^ 


(15.244) 



FIGURE 15.31. Relay channel. 
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Note that the definition of the encoding functions includes the nonan- 
ticipatory condition on the relay. The relay channel input is allowed to 
depend only on the past observations yn, y\ 2 , …， The channel is 
memoryless in the sense that (F /， Yu) depends on the past only through 
the current transmitted symbols (X/, Xu). Thus, for any choice p(w), 
w eW, and code choice X : {1,2,..., 2 nR ) and relay functions 

{fi} n i=v the joint probability mass function onWxX n xX^xy n x 
is given by 

n 

p(w,x, Xi,y, yi) = p(w)Y\ p(xi\w)p(xu\yu, yi2, ■■■, Jii-i) 

i = \ 

x p(yi, yu\xi,xu). (15.245) 

If the message w e [1, 2 nR ] is sent, let 

入 (w;) = Pr{g(Y) ^ w\w sent} (15.246) 

denote the conditional probability of error. We define the average proba¬ 
bility of error of the code as 

= 入 ⑽. (15-247) 

W 

The probability of error is calculated under the uniform distribution over 
the codewords w e {1,..., 2 nR }. The rate R is said to be achievable 
by the relay channel if there exists a sequence of (2 nR , n) codes with 
Pe^ 0. The capacity C of a relay channel is the supremum of the set 
of achievable rates. 

We first give an upper bound on the capacity of the relay channel. 

Theorem 15.7.1 For any relay channel (X x X\, p(y ， y\\x, x\), y x 
y\), the capacity C is bounded above by 

C < sup min{/(X,Xi ； F),/(Z;F, Fi|Xi)}. (15.248) 

p(x,xi) 

Proof: The proof is a direct consequence of a more general max-flow 
min-cut theorem given in Section 15.10. □ 


This upper bound has a nice max-flow min-cut interpretation. The first 
term in (15.248) upper bounds the maximum rate of information transfer 
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from senders X and X\ to receiver Y. The second terms bound the rate 
from X to F and Y\. 

We now consider a family of relay channels in which the relay receiver 
is better than the ultimate receiver Y in the sense defined below. Here the 
max-flow min-cut upper bound in the (15.248) is achieved. 

Definition The relay channel (X^ X\, p(y, yi\x, x\), y x y\) is said 
to be physically degraded if p(y ， y\\x, x\) can be written in the form 

p(y, yi\x,x\) = p(yi\x,xi)p(y\yi,xi). (15.249) 


Thus, y is a random degradation of the relay signal Y\. 

For the physically degraded relay channel, the capacity is given by the 
following theorem. 


Theorem 15.7.2 The capacity C of a physically degraded relay channel 
is given by 

C= sup min{/(X,Xi ； Y),/(X; Fi|Xi)}, (15.250) 

p(x,xi) 

where the supremum is over all joint distributions on X x X\. 

Proof: 

Converse: The proof follows from Theorem 15.7.1 and by degradedness, 
since for the degraded relay channel, Y, Y\\X\) = I(X; Y\\X\). 

Achievability: The proof of achievability involves a combination 
of the following basic techniques: (1) random coding, (2) list codes, 
(3) Slepian-Wolf partitioning, (4) coding for the cooperative multiple- 
access channel, (5) superposition coding, and (6) block Markov encoding 
at the relay and transmitter. We provide only an outline of the proof. 

Outline of achievability: We consider B blocks of transmission, each of 
n symbols. A sequence of B — 1 indices, u)i g {1,..., 2 nR }, i = 1,2,..., 
B — 1, will be sent over the channel in nB transmissions. (Note that as 
S ^ oc for a fixed n, the rate R(B — l)/B is arbitrarily close to R.) 

We define a doubly indexed set of codewords: 

C = {x(w\s), xiO)} : w e {l, 2 nR }, s e {1, 2 nR °}, x g X n , x\ e 

(15.251) 

We will also need a partition 


5 = {5i, 5 2 , 5 2 „« 0 } of W= {1,2, 2 nR ) (15.252) 
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into 2 uRq cells, with Si C\ Sj = i # j ， and U5/ = W. The partition will 
enable us to send side information to the receiver in the manner of Slepian 
and Wolf [502]. 

Generation of random code: Fix p{x\)p{x\x\). 

First generate at random 2 nR ° i.i.d. /i-sequences in each 

drawn according to p(xi) = YYi=i P( x ii)- Index them as Xi^), 5 G 
{1,2,..., 2 nR °}. For each xi ⑴， generate 2 nR conditionally independent 
n-sequences x(if|5), u; G {1, 2, ..., 2 nR ), drawn independently accord¬ 
ing to ^(xlxi^)) = n?=i P(Xi\xu(s)). This defines the random code¬ 
book C = {x(w;|5'), Xi(^)}. The random partition S = {S\, S 2 , •.., S 2 nR 0 } of 
{1,2,..., 2 nR } is defined as follows. Let each integer w e {1,2,..., 2 nR } 
be assigned independently, according to a uniform distribution over the 
indices 5 = 1,2,..., 2 nR °, to cells S s . 

Encoding: Let wi e {1,2,..., 2 nR ] be the new index to be sent in block 
i, and let 5/ be defined as the partition corresponding to Wi-\ (i.e., Wi-\ e 
S Si ). The encoder sends x(k ； /|5v). The relay has an estimate wi-\ of the 
previous index Wi-\. (This will be made precise in the decoding section.) 
Assume that Wi-\ G S 焱 . The relay encoder sends xi(f/) in block i. 

Decoding: We assume that at the end of block i — 1, the receiver 
knows (w ； i, it> 2 ,..., Wi- 2 ) and (^i, 巧 ， .• • ， ^-_i) and the relay knows 
W 2 , … ， Wi-\) and consequently, (^i, 乃 ，…，^). The decoding procedures 
at the end of block i are as follows: 

1. Knowing 5 / and upon receiving yi(/)，the relay receiver estimates 

the message of the transmitter wi = w if and only if there exists 
a unique w such that (x(w;|5/), yi(/)) are jointly 6-typical. 

Using Theorem 15.2.3, it can be shown that Wi = Wi with an arbi¬ 
trarily small probability of error if 

R < Y x \Xi) (15.253) 

and n is sufficiently large. 

2. The receiver declares that Si = s was sent iff there exists one and 
only one ^ such that (xi^), y(0) are jointly 6-typical. From Theo¬ 
rem 15.2.1 we know that Si can be decoded with arbitrarily small 
probability of error if 


and n is sufficiently large. 


Ro<I(X^Y) 


(15.254) 
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3. Assuming that Si is decoded correctly at the receiver, the receiver 
constructs a list L(y (/ — 1)) of indices that the receiver considers to 
be jointly typical with y (/ — 1) in the (/ — l)th block. The receiver 
then declares Wi-\ = w; as the index sent in block / — 1 if there is 
a unique w in S Si fl L(y (/ — 1)). If n is sufficiently large and if 

R < I(X] Y\X { ) + R 0 , (15.255) 

then Wi-\ = Wi-\ with arbitrarily small probability of error. Com¬ 
bining the two constraints (15.254) and (15.255), Ro drops out, 
leaving 

R < I(X; Y\X { ) + I{X x \Y) = I(X, Xi ； Y). (15.256) 

For a detailed analysis of the probability of error, the reader is 
referred to Cover and El Gamal [127]. □ 

Theorem 15.7.2 can also shown to be the capacity for the following 
classes of relay channels: 

1. Reversely degraded relay channel, that is, 

p(y,yi\x,xi) = p(y\x,xi)p(yi\y,xi). (15.257) 

2. Relay channel with feedback 

3. Deterministic relay channel, 

yi = y = g(x,xi). (15.258) 

15.8 SOURCE CODING WITH SIDE INFORMATION 

We now consider the distributed source coding problem where two random 
variables X and Y are encoded separately but only X is to be recovered.We 
now ask how many bits R\ are required to describe X if we are allowed 
/?2 bits to describe Y. If R 2 > H(Y), then Y can be described perfectly, 
and by the results of Slepian-Wolf coding, R\ = H(X\Y) bits suffice 
to describe X. At the other extreme, if R 2 = 0, we must describe X 
without any help, and R\ = H(X) bits are then necessary to describe X. In 
general, we use R 2 = I (Y; Y) bits to describe an approximate version of 
Y. This will allow us to describe X using H(X\Y) bits in the presence of 
side information Y. The following theorem is consistent with this intuition. 
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Theorem 15.8.1 Let (X， Y) 〜 p(x, y). IfY is encoded at rate R 2 and 
X is encoded at rate R\ y we can recover X with an arbitrarily small prob¬ 
ability of error if and only if 

Ri > H(X\U), (15.259) 

R 2 > /(F; U) (15.260) 

for some joint probability mass function p(x, y)p(u\y), where \U\ < 

1 ^ 1 + 2 . 

We prove this theorem in two parts. We begin with the converse, in 
which we show that for any encoding scheme that has a small probability 
of error, we can find a random variable U with a joint probability mass 
function as in the theorem. 

Proof: (Converse). Consider any source code for Figure 15.32. The 
source code consists of mappings f n (X n ) and g n (Y n ) such that the rates of 
f n and g n are less than R\ and R 2 , respectively, and a decoding mapping 
h n such that 

p^ n) = MK(fn(X n ), g n {Y n )) ^X n }< e. (15.261) 

Define new random variables S = f n (X n ) and T = g n {Y n ). Then since 
we can recover X n from S and T with low probability of error, we have, 
by Fano’s inequality, 

H(X n \S, T) <ne n . (15.262) 

Then 

⑻ 

nR 2 > H{T) 

(b) 

> I{Y n \T) 


(15.263) 

(15.264) 



FIGURE 15.32. Encoding with side information. 
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n 


= J^nYr,T\Y u ...,Y i _ l ) 

i=\ 

(15.265) 

n 

1=1 

(15.266) 

些 E 7 ( ㈣ 

i=\ 

(15.267) 

where 


(a) follows from the fact that the range of g n is {1, 2,. 

(b) follows from the properties of mutual information 

(c) follows from the chain rule and the fact that Y[ is 
Y\,, Yi-i and hence /(F/; Fi,... ， F/_i) = 0 

(d) follows if we define f// = (T,Y\,, Yi-\) 

..,2^2} 

independent of 

We also have another chain for R\, 


⑻ 

nRi > H(S) 

(15.268) 

(b) 

> H(S\T) 

(15.269) 

=H(S\T) + H(X n \S, T)- H(X n \S, T) 

(15.270) 

(C) 

> H{X\S\T)-ne n 

(15.271) 

H(X n \T)-ne n 

(15.272) 

n 

= Y J H( ^ x i\ T ^ x u...,X i ^)-n€ n 

i=\ 

(15.273) 

(f) , 

> J^H(Xi\T, X- l ,Y l ~ l )-ne n 

i=\ 

(15.274) 

n 

/ = 1 

(15.275) 

^j^HiXilUd-nen, 

(15.276) 
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where 

(a) follows from the fact that the range of 5 is {1, 2,, 2 nRl } 

(b) follows from the fact that conditioning reduces entropy 

(c) follows from Fano’s inequality 

(d) follows from the chain rule and the fact that 5 is a function of X n 

(e) follows from the chain rule for entropy 

(f) follows from the fact that conditioning reduces entropy 

(g) follows from the (subtle) fact that Xi —> (T, Y l ~ l ) X 卜 1 forms 

a Markov chain since does not contain any information about 
X 卜 1 that is not there in Y l ~ x and T 

(h) follows from the definition of U 

Also, since X, contains no more information about Ui than is present 
in Yi , it follows that X( —> Y[ —> [// forms a Markov chain. Thus we have 
the following inequalities: 

1 n 

n ^ 

i=l 

1 n 
n ^ 

i=\ 

We now introduce a timesharing random variable Q so that we can rewrite 
these equations as 

1 n 

Ri > HWAUi ，2 = 0 = H(Xq\Uq,Q), 

n i=\ 

1 H 

Ri Ui\Q = i) = I{Yq- U q \Q). 

U i=\ 

Now since Q is independent of Yq (the distribution of F/ does not depend 
on i), we have 

I(Y q ; U q \Q) = 1{Yq-U q .Q)- I{Yq\Q) = I(Yq ； Uq,Q). (15.281) 

Now Xq and Yq have the joint distribution p(x, y) in the theorem. Defin¬ 
ing U = (Uq, Q), X = Xq ，and Y = Yq, we have shown the existence 
of a random variable U such that 


(15.279) 

(15.280) 


(15.277) 

(15.278) 


R { >H(X\U), 
R 2 >I(Y; U) 


(15.282) 

(15.283) 
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for any encoding scheme that has a low probability of error. Thus, the 
converse is proved. □ 

Before we proceed to the proof of the achievability of this pair of rates, 
we will need a new lemma about strong typicality and Markov chains. 
Recall the definition of strong typicality for a triple of random variables 
X, Y, and Z. A triplet of sequences x n , y n , z n is said to be 6-strongly 
typical if 

-N(a, b, c\x n , y n , z n ) - p(a, b, c) < VM ^,, - (15.284) 

In particular, this implies that (x n , y n ) are jointly strongly typical and 
that (y n , z n ) are also jointly strongly typical. But the converse is not true: 
The fact that (x n ,y n ) e A* (w) (Z, Y) and (y n ,z n ) e A: ⑻ (7, Z) does not 
in general imply that (x n , y n , z n ) G A^ n \X, Y, Z). But ifX—^Y—^Z 
forms a Markov chain, this implication is true. We state this as a lemma 
without proof [53 ， 149]. 

Lemma 15.8.1 Let (X, Y, Z) form a Markov chain X ^ Y ^ Z [i.e” 
p(x, y, z) = p(x, y)p{z\y)]. If for a given iy n ,z n ) e A* W (Y, Z), X n is 
drawn 〜 n"=i then Pr{(X w , y n , z n ) e A* (n) {X, Y, Z)} > 1 - e 

for n sufficiently large. 

Remark The theorem is true from the strong law of large numbers if 
X n 〜 YYi=\ P( x i\yi^ Zi). The Markovity of X ^ Y ^ Z is used to show 
that X n 〜 p(xi\yi) is sufficient for the same conclusion. 

We now outline the proof of achievability in Theorem 15.8.1. 

Proof: (Achievability in Theorem 15.8.1). Fix p(u\y). Calculate p(u)= 

J2 y p(y)p( u \y)- 

Generation of codebooks: Generate 2 nRl independent codewords of 
length n, U(k ； 2 )，^2 ^ {1,2,..., 2 nRl } according to YYI=i P( u i)- Ran¬ 
domly bin all the X n sequences into 2 nR[ bins by independently generating 
an index b distributed uniformly on {1, 2,, 2 nRl } for each X n • Let B{i) 
denote the set of X n sequences allotted to bin i. 

Encoding: The X sender sends the index i of the bin in which X n falls. 
The Y sender looks for an index 5 such that (Y n , U n (s)) G U). 

If there is more than one such it sends the least. If there is no such 
U n {s) in the codebook, it sends x = 1. 

Decoding: The receiver looks for a unique X n e B(i) such that (X n , 
U n (s)) e Af n \X, U). If there is none or more than one, it declares an 
error. 
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Analysis of the probability of error: The various sources of error are as 
follows: 


1. The pair (X n , Y n ) generated by the source is not typical. The proba¬ 
bility of this is small if n is large. Hence, without loss of generality, 
we can condition on the event that the source produces a particular 
typical sequence {x n , y n ) G A’). 

2. The sequence Y n is typical, but there does not exist a U n {s) in the 
codebook that is jointly typical with it. The probability of this is 
small from the arguments of Section 10.6, where we showed that if 
there are enough codewords; that is, if 

R 2 > I{Y\ U), (15.285) 


we are very likely to find a codeword that is jointly strongly typical 
with the given source sequence. 

3. The codeword U n (s) is jointly typical with y n but not with x n . But 
by Lemma 15.8.1, the probability of this is small since X ^ Y ^ U 
forms a Markov chain. 

4. We also have an error if there exists another typical X n e B{i) which 

is jointly typical with U n {s). The probability that any other X n is 
jointly typical with U n {s) is less than 3 气 and therefore 

the probability of this kind of error is bounded above by 


|B(/) fl A* (n \X)\2~ n{I(x;u) ~ 3e) < 2 n{H{x)+€) 2~ nRl 2~ n{I{x ' u) ~ ?>€ \ 

(15.286) 


which goes to 0 if R\ > H(X\U). 


Hence, it is likely that the actual source sequence X n is jointly typical 
with U n {s) and that no other typical source sequence in the same bin is 
also jointly typical with U n {s). We can achieve an arbitrarily low proba¬ 
bility of error with an appropriate choice of n and 6, and this completes 
the proof of achievability. □ 


15.9 RATE DISTORTION WITH SIDE INFORMATION 

We know that R(D) bits are sufficient to describe X within distortion D. 
We now ask how many bits are required given side information Y. 

We begin with a few definitions. Let Yi) be i.i.d. y) and 

encoded as shown in Figure 15.33. 
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X 



-X 


FIGURE 15.33. Rate distortion with side information. 


Definition The rate distortion function with side information Ry(D) 
is defined as the minimum rate required to achieve distortion D if the 
side information Y is available to the decoder. Precisely, Ry(D) is the 
infimum of rates R such that there exist maps ^ {1,..., 2 nR }, 

g n : y n x {1 , …， 2 nR } X n such that 

limsup Ed(X n , g n (Y n , i n (X n ))) < D. (15.287) 

n—oo 

Clearly, since the side information can only help, we have Ry(D) < 
R(D). For the case of zero distortion, this is the Slepian-Wolf problem 
and we will need H(X\Y) bits. Hence, /? y (0) = H(X\Y). We wish to 
determine the entire curve Ry(D). The result can be expressed in the 
following theorem. 


Theorem 15.9.1 (Rate distortion with side information (Wyner and Ziv)) 
Let (X ， Y) be drawn i.i.d. 〜 p(x ， ;y) and let d{x n , x n ) 
=i d(x“ x ， i) be given. The rate distortion function with side infor¬ 
mation is 


R y (D) = min min (/(X; W) - I{Y\ W)) (15.288) 

pMx) f 

where the minimization is over all functions f : y x W ^ A! and condi¬ 
tional probability mass functions p{w\x) } |W| < |A] + 1, such that 

EEE p(x, y)p(w\x)d(x, f(y, w)) < D. (15.289) 

x w y 

The function / in the theorem corresponds to the decoding map that 
maps the encoded version of the X symbols and the side information Y to 
the output alphabet. We minimize over all conditional distributions on W 
and functions / such that the expected distortion for the joint distribution 
is less than D. 

We first prove the converse after considering some of the properties of 
the function Ry(D) defined in (15.288). 
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Lemma 15.9.1 The rate distortion function with side information 
Ry(D) defined in (15.288) is a nonincreasing convex function of D. 

Proof: The monotonicity of Ry(D) follows immediately from the fact 
that the domain of minimization in the definition of Ry{D) increases with 
D. As in the case of rate distortion without side information, we expect 
Ry(D) to be convex. However, the proof of convexity is more involved 
because of the double rather than single minimization in the definition of 
Ry(D) in (15.288). We outline the proof here. 

Let D\ and £>2 be two values of the distortion and let W\, f\ and 
W 2 , fi be the corresponding random variables and functions that achieve 
the minima in the definitions of Ry{D\) and /?y(Z> 2 )» respectively. Let 
2 be a random variable independent of X, F, W\, and W 2 which takes on 
the value 1 with probability 入 and the value 2 with probability 1 — 入 . 

Define W = (Q, W Q ) and let f(W, Y) = f Q (W Q ,Y). Specifically, 
f(W, Y) = with probability 入 and f(W, Y) = f 2 (W 2 , Y) with 

probability 1 — 入 . Then the distortion becomes 

D = Ed(X,X) (15.290) 

= XEd(X, MW U Y)) + (1 - k)Ed(X, f 2 (W 2 , Y)) (15.291) 

= 入 A + (1- X)D 2 , (15.292) 

and (15.288) becomes 

I(W] X) - Y) = H(X) - H(X\W) - H{Y) + H(Y\W) 

(15.293) 

= H(X)- H(X\W q ,Q)~ H(Y) + H(Y\W q ,Q) 

(15.294) 

= H(X) - XH(X\W { ) - (1 - X)H(X\W 2 ) 

- H(Y) + XH{Y\W x ) + (1- X)H(Y\W 2 ) 

(15.295) 

= X(I(W U X)-I(W 1 ;Y)) 

+ (1-A) (7(W 2 ,X)-/(W 2 ;7)), (15.296) 

and hence 

R y (D) = min I{U\Y)) (15.297) 

U:Ed<D 


I{W\ Y) 


( 15 . 298 ) 
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= 入 (I(W U X)- /(Wi ； F)) + (1 — 入 ) (I(W 2 , X)- I(W 2 ； Y)) 

= 入 /JHA) + (1 - X)Ry(D 2 ), (15.299) 

proving the convexity of Ry(D). □ 

We are now in a position to prove the converse to the conditional rate 
distortion theorem. 

Proof: {Converse to Theorem 15.9.1). Consider any rate distortion code 
with side information. Let the encoding function be / n : ^ {1, 2 , …， 

2 nR }. Let the decoding function be g n : y n x {1,2,, 2 nR } X n , and 

let g n i : y n x {1,2,..., 2 nR } —> X denote the ith symbol produced by the 
decoding function. Let T = f n {X n ) denote the encoded version of X n • 
We must show that if Ed(X n ， g n (Y n ， f n (X n ))) < D, then R > R Y (D). 
We have the following chain of inequalities: 


> H{T) 

(15.300) 

(b) 

> H{T\Y n ) 

(15.301) 

> I(X n ; T\Y n ) 

(15.302) 

n 

=T\Y n , X l ~ l ) 

i=\ 

(15.303) 

n 

=H{Xi\Y n , X 1 - 1 ) - H(Xi\T, Y n , X !_1 ) 

i=\ 

(15.304) 

, n 

=J^ HiXilYi)- H{Xi\T, Yp +l ,r- 1 ) 

i=l 

(15.305) 

(e) 

> ^//(X/17,)- H(Xi\T, y/Vi) 

i=\ 

(15.306) 

r « 

i=l 

(15.307) 

n 

(15.308) 
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i=l 

(15.309) 

i=l 

(15.310) 

n 

- H(Wi) + H{WAYi) 

i=l 

(15.311) 

n 

= Y / i(W r ,x i )~ nWi-Yi) 

i=l 

(15.312) 

• n 

Ij^RriEdiX^g^W^Yi))) 

i=l 

(15.313) 

1 H 

= l r-J^R Y (Ed(X i ,g , ni (W i ,Y i ))) 

H i = \ 

(15.314) 

>nR Y EdiX^g^iWi, 

(15.315) 

(k) 

> nR Y (D), 

(15.316) 


where 

(a) follows from the fact that the range of T is {1, 2,, 2 nR } 

(b) follows from the fact that conditioning reduces entropy 

(c) follows from the chain rule for mutual information 

(d) follows from the fact that Z/ is independent of the past and future 
Y's and X's given Yi 

(e) follows from the fact that conditioning reduces entropy 

(f) follows by defining Wi = (T, Y l ~ l , Y^) 

(g) follows from the definition of mutual information 

(h) follows from the fact that since F/ depends only on X/ and is condi¬ 
tionally independent of T and the past and future Y's, Wi —> X/ — 
Yi forms a Markov chain 

(i) follows from the definition of the (information) conditional 

rate distortion function since Xi = g ni {T, Y n ) = g' ni (Wi, Yi), 
and hence I(Wr, X t ) - /(W,; F ( ) > min w . Ed(x ^ Dj I(W-,X)- 
I(W-,Y) = R Y (Di) 
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(j) follows from Jensen’s inequality and the convexity of the conditional 
rate distortion function (Lemma 15.9.1) 

(k) follows from the definition of D = E[^ Y^=i d(X“ X/)] □ 

It is easy to see the parallels between this converse and the converse 
for rate distortion without side information (Section 10.4). The proof of 
achievability is also parallel to the proof of the rate distortion theorem 
using strong typicality. However, instead of sending the index of the 
codeword that is jointly typical with the source, we divide these codewords 
into bins and send the bin index instead. If the number of codewords in 
each bin is small enough, the side information can be used to isolate 
the particular codeword in the bin at the receiver. Hence again we are 
combining random binning with rate distortion encoding to find a jointly 
typical reproduction codeword. We outline the details of the proof below. 

Proof: (Achievability of Theorem 15.9.1). Fix p(w\x) and the function 
/(it), y). Calculate p{w) = p(x)p(w\x). 

Generation of codebook: Let R\ = I{X\ W) + e. Generate 2 nR i.i.d. 
codewords 〜 P( w i), and index them by 5 1 2 G {1,2,..., 2 nRl }. 

Let /?2 = I(X; W) — I(Y; W) + 56. Randomly assign the indices ^ G 
{1,2,..., 2 nRl } to one of 2 nRl bins using a uniform distribution over 
the bins. Let B{i) denote the indices assigned to bin i. There are approx¬ 
imately 2 n ( Rl_R2 ) indices in each bin. 

Encoding: Given a source sequence X n , the encoder looks for a code¬ 
word W n (s) such that (X n , W n (s)) e A: ⑻ • If there is no such W n , the 
encoder sets 5 1 = 1. If there is more than one such s, the encoder uses the 
lowest 5. The encoder sends the index of the bin in which 5 belongs. 

Decoding: The decoder looks for a W n (s) such that s e B{i) and 
(W n (s), Y n ) g A: ⑻ .If he finds a unique s, he then calculates X n , where 
Xi = f(Wi, Yi). If he does not find any such 5* or more than one such 5 1 , 
he sets X n = x n , where x n is an arbitrary sequence in X n . It does not 
matter which default sequence is used; we will show that the probability 
of this event is small. 

Analysis of the probability of error: As usual, we have various error 
events: 

1. The pair (X n , Y n ) ^ A: ⑻. The probability of this event is small for 
large enough n by the weak law of large numbers. 

2. The sequence X n is typical, but there does not exist an 5 such that 
(X n , W n (s)) e Af n \ As in the proof of the rate distortion theorem, 
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the probability of this event is small if 

R { > 1{W\X). (15.317) 

3. The pair of sequences (X n , W n (s)) e Af n) but (W w (s), Y n ) ^ Af n) 
(i.e., the codeword is not jointly typical with the Y n sequence). By 
the Markov lemma (Lemma 15.8.1), the probability of this event is 
small if n is large enough. 

4. There exists another s r with the same bin index such that {W n {s r ), 
Y n ) G A^ n \ Since the probability that a randomly chosen W n is 
jointly typical with Y n is ^ 2— w " y; 州 )， the probability that there is 
another W n in the same bin that is typical with Y n is bounded by 
the number of codewords in the bin times the probability of joint 
typicality, that is, 

Pr(3s f e B{i) : (W n (s'), Y n ) e A* (w) ) < 2 n{Rl - R2) 2~ n(I(w;Y) ~ 3€ \ 

(15.318) 

which goes to zero since R\ — R 2 < I(Y\W) — 36. 

5. If the index 5 is decoded correctly, (X n , W n (s)) e A^ n \ By item 1 
we can assume that (X n , Y n ) g A^ n \ Thus, by the Markov lemma, 
we have (X n , Y n , W n ) G A:( w ) and therefore the empirical joint dis¬ 
tribution is close to the original distribution p(x, y)p{w\x) that we 
started with, and hence (X n , X n ) will have a joint distribution that 
is close to the distribution that achieves distortion D. 

Hence with high probability, the decoder will produce X n such that the 
distortion between X n and X n is close to nD. This completes the proof 
of the theorem. □ 

The reader is referred to Wyner and Ziv [574] for details of the proof. 
After the discussion of the various situations of compressing distributed 
data, it might be expected that the problem is almost completely solved, 
but unfortunately, this is not true. An immediate generalization of all 
the above problems is the rate distortion problem for correlated sources, 
illustrated in Figure 15.34. This is essentially the Slepian-Wolf problem 
with distortion in both X and Y. It is easy to see that the three dis¬ 
tributed source coding problems considered above are all special cases 
of this setup. Unlike the earlier problems, though, this problem has not 
yet been solved and the general rate distortion region remains 
unknown. 
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Y n 



A A 

(X n , Y n ) 


FIGURE 15.34. Rate distortion for two correlated sources. 

15.10 GENERAL MULTITERMINAL NETWORKS 

We conclude this chapter by considering a general multiterminal network 
of senders and receivers and deriving some bounds on the rates achievable 
for communication in such a network. A general multiterminal network 
is illustrated in Figure 15.35. In this section, superscripts denote node 
indices and subscripts denote time indices. There are m nodes, and node 
i has an associated transmitted variable and a received variable Y^ l \ 



FIGURE 15.35. General multiterminal network. 
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The node i sends information at rate 7 ? (⑺ to node j. We assume that all 
the messages VK(") being sent from node i to node j are independent and 
uniformly distributed over their respective ranges {1,2 ,..., 2 nR(，j) }. 

The channel is represented by the channel transition function 
p(y^ l \ •. • ， y( w )|x( 1 )，• • • ， x( w ))，which is the conditional probability mass 
function of the outputs given the inputs. This probability transition func¬ 
tion captures the effects of the noise and the interference in the network. 
The channel is assumed to be memoryless (i.e.，the outputs at any time 
instant depend only the current inputs and are conditionally independent 
of the past inputs). 

Corresponding to each transmitter-receiver node pair is a message 
e {1,2,..., 2 nR ° j) }. The input symbol at node i depends on 
W^ l ^\ j G {1, … ， m} and also on the past values of the received symbol 
at node i. Hence, an encoding scheme of block length n consists of 
a set of encoding and decoding functions, one for each node: 

• Encoders: xj!\w {n \ W (i2 \ ..., W {im \ if ) ， if), … ， Y^), k = l, 

...,n. The encoder maps the messages and past received symbols 
into the symbol transmitted at time k. 

• Decoders: (if), Y^\ W ( ⑴， W {im) \ j = 1,2,… ， m_ 

The decoder j at node i maps the received symbols in each block and 
his own transmitted information to form estimates of the messages 
intended for him from node j ， j = 1,2, … ， m. 

Associated with every pair of nodes is a rate and a corresponding 
probability of error that the message will not be decoded correctly, 

p(n)(^ = p r ( 兪 ( ⑴ （ Y W，• • ， ， W U>n)^ ^ 灰 ⑼） ’ (15.319) 

(n) (ij) 

where Pe is defined under the assumption that all the messages are 
independent and distributed uniformly over their respective ranges. 

A set of rates {/?(’))} is said to be achievable if there exist encoders and 

( n )(ij) 

decoders with block length n with ^ 0 as n ^ oc for all /, j e 

{1,2,..., m}. We use this formulation to derive an upper bound on the 
flow of information in any multiterminal network. We divide the nodes 
into two sets, S and the complement S c • We now bound the rate of flow 
of information from nodes in S to nodes in S c . See [514] 
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Theorem 15.10.1 If the information rates are achievable，there 

exists some joint probability distribution p{x^ l \ x( 2 ) ， ... ， x( w )) such that 

R (U) < Y {SC) \X (SC) ) (15.320) 

ieS,jeS c 


for all S C {1,2,, m}. Thus, the total rate of flow of information across 
cut sets is bounded by the conditional mutual information. 


Proof: The proof follows the same lines as the proof of the converse 
for the multiple access channel. Let T = {(/, j) : i e S, j e 5 C } be the set 
of links that cross from S to S c , and let 7 c be all the other links in the 
network. Then 


n ^ 

ieSJeS c 


= H { W(U) ) 
ieS,jeS c 


普 H (W (T) ) 


(C) 


/ (V (r) ; [ ⑺， ... ， y„ (sc) l w (rc) ) 

+ h (w ⑺ jy}’) ， ... ， y,p C) , w (rc) : 




( w (r); 匕 ⑺” • ， &#) |H ^)) 


+ ne„ 


= (w (r) ； Of)， …， d W^+ne t 



(15.321) 

(15.322) 

(15.323) 

(15.324) 

(15.325) 

(15.326) 

(15.327) 

(15.328) 


(15.329) 
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{ < J2 H ( Or )， … w(re) ， xi k SC) ) 

k=\ 

— H (yf £ .)|y 厂 ) ，…， d w iTC) , W {T) ,X { k S) , + ne n 


(15.330) 

(yf C) |xf ^ - H (yf K c) ， 4 S) ) + (15.331) 

k=\ 

= f^I (Xf); Y^ixf^+nen (15.332) 

k=\ 

= «^/ (xf; Y { p\X { p, Q = k) + ne n (15.333) 

U k=l 

( ^nl Y { Q sC) \X ( p,Q^+ne n (15.334) 

= n{H Q)-H 哎， X Q C) ^ Q))+ n ^n (15.335) 

< n (// («))-// (YP\xf,X ( l C \Q))+ne n (15.336) 

= n[H H (V 『 )l#) ， X e°)) + n6n (b.337) 

=nl ( 々 ; Y { p\X ( p^j + ne n , (15.338) 

where 


(a) follows from the fact that the messages are uniformly dis¬ 

tributed over their respective ranges {1,2 ,..., 2 nRilj) } 

(b) follows from the definition of = {W(") : i e S, j e 5 C } and 
the fact that the messages are independent 

(c) follows from the independence of the messages for T and T c 

(d) follows from Fano’s inequality since the messages can be 

decoded from and 

(e) is the chain rule for mutual information 

(f) follows from the definition of mutual information 

(S c ) 

(g) follows from the fact that X k } is a function of the past received 

symbols F ⑺ and the messages and the fact that adding 

conditioning reduces the second term 
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(S c ) 

(h) follows from the fact that ’ depends only on the current input 
symbols X ( k S) and X ? 、 

(i) follows after we introduce a new timesharing random variable Q 
distributed uniformly on {1, 2, ..., n} 

(j) follows from the definition of mutual information 

(k) follows from the fact that conditioning reduces entropy 

(l) follows from the fact that } depends only on the inputs X^q and 

(S c ) 

Xq and is conditionally independent of Q 

Thus, there exist random variables X ⑸ and with some arbitrary 
joint distribution that satisfy the inequalities of the theorem. □ 

The theorem has a simple max-flow min-cut interpretation. The rate of 
flow of information across any boundary is less than the mutual informa¬ 
tion between the inputs on one side of the boundary and the outputs on 
the other side, conditioned on the inputs on the other side. 

The problem of information flow in networks would be solved if the 
bounds of the theorem were achievable. But unfortunately, these bounds 
are not achievable even for some simple channels. We now apply these 
bounds to a few of the channels that we considered earlier. 

• Multiple-access channel. The multiple access channel is a network 
with many input nodes and one output node. For the case of a two-user 
multiple-access channel, the bounds of Theorem 15.10.1 reduce to 

/?! </(Zi ； F|X 2 ), (15.339) 

R 2 <I(X 2 ^Y\X l ), (15.340) 

Ri + R2<nX u X 2 ; Y) (15.341) 

for some joint distribution p{x\, X 2 )p{y\x\, X 2 ). These bounds coin¬ 
cide with the capacity region if we restrict the input distribution to 
be a product distribution and take the convex hull (Theorem 15.3.1). 

• Relay channel. For the relay channel, these bounds give the upper 
bound of Theorem 15.7.1 with different choices of subsets as shown 
in Figure 15.36. Thus, 

C < sup min{/(X,X 1 ;r) ， /(X;F ， F 1 |X 1 )}. (15.342) 

p(x,xi) 


This upper bound is the capacity of a physically degraded relay chan¬ 
nel and for the relay channel with feedback [127]. 
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U > X-\ > 

p(y l^i. x 2 ) 

v - ^ x 2 - ► 


Y —— ^ (U, V) 


FIGURE 15.37. Transmission of correlated sources over a multiple-access channel. 


To complement our discussion of a general network, we should mention 
two features of single-user channels that do not apply to a multiuser 
network. 

• Source-channel separation theorem. In Section 7.13 we discussed 
the source-channel separation theorem, which proves that we can 
transmit the source noiselessly over the channel if and only if the 
entropy rate is less than the channel capacity. This allows us to char¬ 
acterize a source by a single number (the entropy rate) and the channel 
by a single number (the capacity). What about the multiuser case? 
We would expect that a distributed source could be transmitted over 
a channel if and only if the rate region for the noiseless coding of the 
source lay within the capacity region of the channel. To be specific, 
consider the transmission of a distributed source over a multiple- 
access channel, as shown in Figure 15.37. Combining the results of 
Slepian-Wolf encoding with the capacity results for the multiple- 
access channel, we can show that we can transmit the source over 
the channel and recover it with a low probability of error if 


H(U\V)<I(X^Y\X 2 , Q), 


( 15 . 343 ) 
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H(V\U)<I(X 2 ； Y\X u Q), 
H(U, V)<I(X u X 2 ;Y\Q) 


(15.344) 

(15.345) 


for some distribution p(q)p(xi\q)p(x 2 \q)p(y\x\, X 2 ). This condition 
is equivalent to saying that the Slepian-Wolf rate region of the source 
has a nonempty intersection with the capacity region of the multiple- 
access channel. 

But is this condition also necessary? No, as a simple example illus¬ 
trates. Consider the transmission of the source of Example 15.4.2 
over the binary erasure multiple-access channel (Example 15.3.3). 
The Slepian-Wolf region does not intersect the capacity region, yet 
it is simple to devise a scheme that allows the source to be transmit¬ 
ted over the channel. We just let X\ = U and X 2 = V, and the value 
of Y will tell us the pair (U, V) with no error. Thus, the conditions 
(15.345) are not necessary. 

The reason for the failure of the source-channel separation theorem 
lies in the fact that the capacity of the multiple-access channel 
increases with the correlation between the inputs of the channel. 
Therefore, to maximize the capacity, one should preserve the cor¬ 
relation between the inputs of the channel. Slepian-Wolf encoding, 
on the other hand, gets rid of the correlation. Cover et al. [129] pro¬ 
posed an achievable region for transmission of a correlated source 
over a multiple access channel based on the idea of preserving the 
correlation. Han and Costa [273] have proposed a similar region for 
the transmission of a correlated source over a broadcast channel. 

• Capacity regions with feedback. Theorem 7.12.1 shows that feedback 
does not increase the capacity of a single-user discrete memoryless 
channel. For channels with memory, on the other hand, feedback 
enables the sender to predict something about the noise and to combat 
it more effectively, thus increasing capacity. 

What about multiuser channels? Rather surprisingly, feedback does 
increase the capacity region of multiuser channels, even when the 
channels are memoryless. This was first shown by Gaarder and Wolf 
[220], who showed how feedback helps increase the capacity of the 
binary erasure multiple-access channel. In essence, feedback from the 
receiver to the two senders acts as a separate channel between the two 
senders. The senders can decode each other’s transmissions before the 
receiver does. They then cooperate to resolve the uncertainty at the 
receiver, sending information at the higher cooperative capacity rather 
than the noncooperative capacity. Using this scheme, Cover and 
Leung [133] established an achievable region for a multiple-access 
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channel with feedback. Willems [557] showed that this region was 
the capacity for a class of multiple-access channels that included the 
binary erasure multiple-access channel. Ozarow [410] established the 
capacity region for a two-user Gaussian multiple-access channel. The 
problem of finding the capacity region for a multiple-access channel 
with feedback is closely related to the capacity of a two-way channel 
with a common output. 

There is as yet no unified theory of network information flow. But there 
can be no doubt that a complete theory of communication networks would 
have wide implications for the theory of communication and computation. 


SUMMARY 

Multiple-access channel. The capacity of a multiple-access channel 
{X\ x ^ 2 , p(y\x\, X 2 ), JO is the closure of the convex hull of all (R\, R 2 ) 
satisfying 

Ri </(x 1； y|x 2 ), 

R 2 < I(X 2 ； Y\X l ), 

R l + R2<I(X l ,X 2 ； Y) 

for some distribution p\(x\)p 2 (x 2 ) on X\ x X 2 . 

The capacity region of the m-user multiple-access channel is the closure 
of the convex hull of the rate vectors satisfying 

R(S) < 7(X(5); Y\X(S C )) for all 5 c {1, 2,..., m} (15.349) 

for some product distribution p\(x\)p 2 (x 2 ) - - - p m (x m ). 

Gaussian multiple-access channel. The capacity region of a two-user 
Gaussian multiple-access channel is 

Ri<C (令) ， (15.350) 

Ri<C (令) ， ( 15 . 351 ) 


(15.346) 

(15.347) 

(15.348) 
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Ri+R 2 <C ( fl : P2 ) ， (15.352) 

where 

C{x) = ^log(l+x). (15.353) 

Slepian-Wolf coding. Correlated sources X and Y can be described 
separately at rates R\ and R 2 and recovered with arbitrarily low prob¬ 
ability of error by a common decoder if and only if 

Ri > H(X\Y), (15.354) 

R 2 > H(Y\X), (15.355) 

Ri + R 2 > H(X, Y). (15.356) 

Broadcast channels. The capacity region of the degraded broadcast 
channel X ^ Y\ ^ Y 2 is the convex hull of the closure of all (/?i, R 2 ) 
satisfying 

R2<I(U;Y 2 ), (15.357) 

Ri < 1{X\ Y X \U) (15.358) 

for some joint distribution p(u)p(x\u)p(y\, y 2 \x). 

Relay channel. The capacity C of the physically degraded relay chan- 
nel p{y, y\\x,xi) is given by 

C = sup min{/(X, Zi ； y),/(Z;Fi|Xi)}, (15.359) 

P(.x,xi) 

where the supremum is over all joint distributions on X x X\. 

Source coding with side information. Let (X, F) 〜 p(x, y). If Y is 

encoded at rate R 2 and X is encoded at rate R\, we can recover X with 
an arbitrarily small probability of error iff 

Ri > H(X\U), (15.360) 

R 2 > /(F; U) (15.361) 

for some distribution p{y, u) such that X —> F —> f/. 
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Rate distortion with side information. Let (X, F ) 〜 p(x ， y). The 

rate distortion function with side information is given by 

R V (D) = min min I(X; W) - /(Y; W), (15.362) 

P(yj\x) f ： y x W^^ 

where the minimization is over all functions / and conditional distri¬ 
butions p(w\x), |W| < |A1 + 1, such that 

EEE p(x, y)p(w\x)d(x, f(y, w)) < D. (15.363) 

x w y 


PROBLEMS 

15.1 Cooperative capacity of a multiple-access channel 


( 叫為 ) 



p(yki» ^2) 


v —— 


(a) Suppose that X\ and X 2 have access to both indices G 
{1, 2 nR }, W 2 e {1, 2 nRl ). Thus, the codewords X\(W\, 
W 2 ), X 2 (V^i, W 2 ) depend on both indices. Find the capacity 
region. 

(b) Evaluate this region for the binary erasure multiple access 
channel Y = X\ + X 2 , X( e {0, 1}. Compare to the noncoop¬ 
erative region. 

15.2 Capacity of multiple-access channels . Find the capacity region 
for each of the following multiple-access channels: 

(a) Additive modulo 2 multiple-access channel. X\ e {0, 1}, 

(b) Multiplicative multiple-access channel. X\ g {—1, 1}, 
X 2 e{-l,l},Y = X^X 2 . 






15.3 
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Cut-set interpretation of capacity region of multiple-access chan¬ 
nel. For the multiple-access channel we know that (R\, R 2 ) is 
achievable if 


Ri < I(Xu Y I X 2 ), (15.364) 

R 2 < I{X 2 \ Y I Xi), (15.365) 

Ri + R 2 < /(Xi,X 2 ; Y) (15.366) 

for X\, X 2 independent. Show, for X\, X 2 independent that 

I(X 1 ',Y\X 2 ) = I(XuY, x 2 ). 



Interpret the information bounds as bounds on the rate of flow 
across cut sets S\, S 2 , and S'. 

15.4 Gaussian multiple-access channel capacity. For the AWGN 
multiple-access channel, prove, using typical sequences, the 
achievability of any rate pairs (R\, R 2 ) satisfying 

R, < Uog + (15.367) 

尺 2 < *l 0 g(l + 尝)， (15.368) 

r 1 + r 2< Ilog ^1 + H • (15.369) 
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The proof extends the proof for the discrete multiple-access chan¬ 
nel in the same way as the proof for the single-user Gaussian 
channel extends the proof for the discrete single-user channel. 


15.5 Converse for the Gaussian multiple-access channel • Prove the 
converse for the Gaussian multiple-access channel by extending 
the converse in the discrete case to take into account the power 
constraint on the codewords. 

15.6 Unusual multiple-access channel . Consider the following 
multiple-access channel: X\ = X 2 = y = {0, 1}. If (X\, X 2 )= 
(0, 0), then F = 0. If (X u X 2 ) = (0, 1), then Y = l.If (Xi,X 2 )= 
(1,0)，then F = 1. If (X\, X 2 ) = (1, 1), then F = 0 with proba¬ 
bility i and Y = l with probability 

(a) Show that the rate pairs (1,0) and (0,1) are achievable. 

(b) Show that for any nondegenerate distribution p(x\)p(x 2 ), we 
have /(Zi,X 2 ; Y) < 1. 

(c) Argue that there are points in the capacity region of this 
multiple-access channel that can only be achieved by time¬ 
sharing; that is, there exist achievable rate pairs (R\, R 2 ) that 
lie in the capacity region for the channel but not in the region 
defined by 

Ri < I(Xu Y\X 2 ), (15.370) 

Ri< I{X 2 \ (15.371) 

Ri + R2<I(X u X 2 ; Y) (15.372) 

for any product distribution p{x\)p{x 2 ). Hence the operation 
of convexification strictly enlarges the capacity region. This 
channel was introduced independently by Csiszar and Komer 
[149] and Bierbaum and Wallmeier [59]. 

15.7 Convexity of capacity region of broadcast channel. Let C c R 2 
be the capacity region of all achievable rate pairs R = (/?i, R 2 ) 
for the broadcast channel. Show that C is a convex set by using 
a time-sharing argument. Specifically, show that if R (1) and R (2) 
are achievable , 人 R ⑴ + (1 — 入 ) R ⑵ is achievable for 0 < A < 1. 

15.8 Slepian — Wolf for deterministically related sources . Find and 
sketch the Slepian-Wolf rate region for the simultaneous data 
compression of (X, Y), where y = f(x) is some deterministic 
function of x. 
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15.9 Slepian—Wolf • Let Xi be i.i.d. Bernoulli(p). Let Z/ be i.i.d . 〜 
Bernoulli(r), and let Z be independent of X. Finally, let Y = 
X ㊉ Z (mod 2 addition). Let X be described at rate R\ and Y 
be described at rate R:. What region of rates allows recovery of 
X, Y with probability of error tending to zero? 

15.10 Broadcast capacity depends only on the conditional marginals • 
Consider the general broadcast channel (X, Y\ x Y 2 , p(yi, yi I ^)). 
Show that the capacity region depends only on p{y\ \ x) and p{y 2 I 
x). To do this, for any given (( 2 nRl , 2 nRl ), n) code, let 


p[ n) 

= P{Wi(Yi) ^ Wi}, 

(15.373) 

尸 2 ⑷ 

= P{W 2 (Y 2 ) ^ W 2 }, 

(15.374) 

p{ n ) 

=P{(#i, W 2 ) ^ (W U W 2 )}. 

(15.375) 


Then show that 

P 2 (,,) } < P (n) < p[ n) + P 2 (n) . 

The result now follows by a simple argument. (Remark: The 
probability of error P ⑻ does depend on the conditional joint 
distribution p(y\, y 2 \ x). But whether or not P ⑻ can be driven 
to zero [at rates (/?i, /? 2 )] does not [except through the conditional 
marginals p(y\ \ x), p(y 2 \ x)] .) 

15.11 Converse for the degraded broadcast channel • The following 
chain of inequalities proves the converse for the degraded dis¬ 
crete memoryless broadcast channel. Provide reasons for each of 
the labeled inequalities. 

Setup for converse for degraded broadcast channel capacity: 

(Wu ^2)indep. ^ ^(^ 1 , W 2 ) ^ yf ^ r 2 n . 

• Encoding f n : 2 nR ^ x 2 nR2 X n 

• Decoding: gn ^ 2 nRl , h n : ^ 2 nRl . Let Ui = 

(W 2 , d. Then 

nRi < F ano ， ( 恥 ; 拉） （ 15.376) 


(15.377) 
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= J2^ H(Y2i 1 y 2 _1 ) - H ^i I ^2, Y^ 1 )) (15.378) 

i 

(C) i 

< - 沒 (bl W 2 , )) (15.379) 


= - H(Y 2i I W 2 , 7；- 1 )) (15.380) 

i 

n 

㊂ jy(Ur，Y 2i ). (15.381) 

/ = 1 

Continuation of converse: Give reasons for the labeled inequali¬ 
ties: 

nR i <Fano ^(^i ； Y ") (15.382) 

(f) 

< /(Wi ； yf, W 2 ) (15.383) 

(g) 

< I{Wv, rf I W 2 ) (15.384) 

I r；- 1 , W 2 ) (15.385) 

/-I 

(i) A 

< I Ui). (15.386) 

i=\ 

Now let 2 be a time-sharing random variable with Pr(2 = 0 = 
\/n, i = 1, 2,..., n. Justify the following: 

Ri<I(X q -Y iq \U q ,Q), (15.387) 

R2<nU Q ;Y 2Q \Q) (15.388) 

for some distribution p(q)p(u\q)p(x\u, q)p(y\, y 2 \x). By appro¬ 
priately redefining U, argue that this region is equal to the convex 
closure of regions of the form 

Ri </(X;Fi|t/), (15.389) 

R2<I(U;Y 2 ) (15.390) 


for some joint distribution p(u)p(x\u)p(y\, y 2 \x) 
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15.12 Capacity points. 

(a) For the degraded broadcast channel X ^ Y\ ^ Y 2 , find the 
points a and b where the capacity region hits the R\ and R 2 
axes. 



(b) Show that b < a. 

15.13 Degraded broadcast channel . Find the capacity region for the 
degraded broadcast channel shown below. 


1 -p 



15.14 Channels with unknown parameters • We are given a binary 
symmetric channel with parameter p. The capacity is C = 1 — 
H(p). Now we change the problem slightly. The receiver knows 
only that p g {p\, P2} (i.e., p = or p = p 2 , where p\ and P2 
are given real numbers). The transmitter knows the actual value 
of p. Devise two codes for use by the transmitter, one to be used 
if p = the other to be used if p = P 2 , such that transmission 
to the receiver can take place at rate « C(pi) if p = Pl and at 
rate ^ C(p 2 ) if p = p:. {Hint: Devise a method for revealing 
p to the receiver without affecting the asymptotic rate. Prefixing 
the codeword by a sequence of Vs of appropriate length should 
work.) 
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15.15 Two-way channel • Consider the two-way channel shown in 
Figure 15.6. The outputs Y\ and depend only on the current 
inputs X\ and X 2 . 

(a) By using independently generated codes for the two senders, 
show that the following rate region is achievable: 

Ri <I(XuY 2 \X 2 ), (15.391) 

R 2 < I(X 2 ;Y l \X l ) (15.392) 

for some product distribution p(x\)p(x 2 )p(yi, ^ 2 ). 

(b) Show that the rates for any code for a two-way channel with 
arbitrarily small probability of error must satisfy 


Ri </(Xi ； y 2 |X 2 ), (15.393) 

R2<I{X 2 \Yx\X x ) (15.394) 

for some joint distribution p(x\, X 2 )p(yi, ^ 2 )- 
The inner and outer bounds on the capacity of the two-way 
channel are due to Shannon [486]. He also showed that the inner 
bound and the outer bound do not coincide in the case of the 
binary multiplying channel = X 2 = y\ = y 2 = {0, 1}, Y\ = 
Y 2 = X 1 X 2 . The capacity of the two-way channel is still an open 
problem. 

15.16 Multiple-access channel • Let the output F of a multiple-access 
channel be given by 


F = ^+ 8 § 11 (^ 2 ), 


where X\, X 2 are both real and power limited, 


E{X\) < Pi, 
E(Xj) < P 2 , 


and sgn(x)= 


1, x > 0, 

— 1, x < 0. 


Note that there is interference but no noise in this channel. 


(a) Find the capacity region. 

(b) Describe a coding scheme that achieves the capacity region. 
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15.17 Slepian—Wolf • Let (X, Y) have the joint probability mass func¬ 
tion p(x, y): 


p{x, y) 

1 

2 

3 

1 

a 

p 


2 

p 

a 


3 

p 


a 


where (Note: This is a joint, not a conditional, prob¬ 

ability mass function.) 

(a) Find the Slepian-Wolf rate region for this source. 

(b) What is Pr{X = Y} in terms of a? 

(c) What is the rate region if a = |? 

(d) What is the rate region if a = |? 

15.18 Square channel • What is the capacity of the following multiple- 
access channel? 

^1 G {-1,0,1}, 

X 2 e {-1,0,1}, 

Y = X\ + X\. 


(a) Find the capacity region. 

(b) Describe p*(X 2 ) achieving a point on the boundary of 

the capacity region. 

15.19 Slepian—Wolf • Two senders know random variables U\ and U 2 , 
respectively. Let the random variables {U\, U 2 ) have the following 
joint distribution: 


Ui\U 2 0 


0 a 

1 




m — 



0 

0 


0 

0 


m ~ l 0 0 


0 


where a + ^ + y = l. Find the region of rates (/?i, R 2 ) that would 
allow a common receiver to decode both random variables reliably. 
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15.20 


15.21 


15.22 


Multiple access 

(a) Find the capacity region for the multiple-access channel 

F = Xf 2 , 

where 

如 {2, 4} ， X 2 6{1,2}. 

(b) Suppose that the range of X\ is {1, 2}. Is the capacity region 
decreased? Why or why not? 

Broadcast channel • Consider the following degraded broadcast 
channel. 




(a) What is the capacity of the channel from X to Y\1 

(b) What is the channel capacity from X to F 2 ? 

(c) What is the capacity region of all (/?i, R 2 ) achievable for this 
broadcast channel? Simplify and sketch. 

Stereo. The sum and the difference of the right and left ear sig¬ 
nals are to be individually compressed for a common receiver. Let 
Z\ be Bernoulli (p\) and Z 2 be Bernoulli (P 2 ) and suppose that 
Z\ and Z 2 are independent. Let X = Z\ + Z 2 , and Y = Z\ — Z 2 . 
(a) What is the Slepian-Wolf rate region of achievable (Rx, ^y)? 



A A 

(X, Y) 
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(b) Is this larger or smaller than the rate region of (/?z P 尺 z 2 )? 
Why? 



There is a simple way to do this part. 

15.23 Multiplicative multiple-access channel • Find and sketch the ca¬ 
pacity region of the following multiplicative multiple-access chan¬ 
nel: 


A 





-Y 


with Xi g {0, 1}, Z 2 G {1, 2, 3}, and Y = X { X 2 . 

15.24 Distributed data compression. Let Zi, Z 2 , Z 3 be independent 
Bernoulli(/?). Find the Slepian-Wolf rate region for the description 
of (Xi, X 2 , X 3 ), where 


X2 = Z\ + Z2 
X3 = Zi + Z2 + Z3. 


A 




^3 



(w 3 ) 
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15.25 Noiseless multiple-access channel. Consider the following 
multiple-access channel with two binary inputs X 2 G { 0 , 1 } 
and output Y = (X\, X 2 ). 

(a) Find the capacity region. Note that each sender can send at 
capacity. 

(b) Now consider the cooperative capacity region, R[ > 0, R 2 > 
0, R\ + R 2 < max p(xuX2) I (Xu X 2 ； Y). Argue that the 
throughput /?i + /?2 does not increase but the capacity region 
increases. 

15.26 Infinite bandwidth multiple-access channel. Find the capacity 
region for the Gaussian multiple-access channel with infinite band¬ 
width. Argue that all senders can send at their individual capacities 
(i.e., infinite bandwidth eliminates interference). 

15.27 Multiple-access identity . Let C(x) = ^ log(l + x) denote the 
channel capacity of a Gaussian channel with signal-to-noise ratio 
x. Show that 


C 



+ c 


忐) 


c 


尸 1 + 尸2 、 
~ N ~ 


This suggests that two independent users can send information as 
well as if they had pooled their power. 

15.28 Frequency-division multiple access (FDMA). Maximize the 

throughput R 1 + R 2 = V^i log(l + + (W - Wi) log(l + 

而兵而 ）over W\ to show that bandwidth should be proportional 
to transmitted power for FDMA. 

15.29 Trilingual-speaker broadcast channel • A speaker of Dutch, 
Spanish, and French wishes to communicate simultaneously to 
three people: D, 5, and F. D knows only Dutch but can distin¬ 
guish when a Spanish word is being spoken as distinguished from 
a French word; similarly for the other two, who know only Span¬ 
ish and French, respectively, but can distinguish when a foreign 
word is spoken and which language is being spoken. Suppose 
that each language, Dutch, Spanish, and French, has M words: M 
words of Dutch, M words of French, and M words of Spanish. 

(a) What is the maximum rate at which the trilingual speaker can 
speak to D1 

(b) If he speaks to D at the maximum rate, what is the maximum 
rate at which he can speak simultaneously to SI 
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(c) If he is speaking to D and S at the joint rate in part (b), can 
he also speak to F at some positive rate? If so, what is it? If 
not, why not? 

15.30 Parallel Gaussian channels from a mobile telephone. Assume 
that a sender X is sending to two fixed base stations. Assume that 
the sender sends a signal X that is constrained to have average 
power P. Assume that the two base stations receive signals Y\ 
and Y 2 , where 

Fi =aiX + Zj 
Y2 = OC2X + Z 2 , 

where Z/ 〜 Af(Q, N\), Zi 〜 Af(0, N 2 ), and Z\ and Z 2 are inde¬ 
pendent. We will assume the a’s are constant over a transmitted 
block. 

(a) Assuming that both signals Y\ and Y 2 are available at a com¬ 
mon decoder Y = (Fi, Y 2 ), what is the capacity of the channel 
from the sender to the common receiver? 

(b) If, instead, the two receivers Y\ and Y 2 each decode their sig¬ 
nals independently, this becomes a broadcast channel. Let R\ 
be the rate to base station 1 and R 2 be the rate to base station 
2. Find the capacity region of this channel. 

15.31 Gaussian multiple access. A group of m users, each with power 
P, is using a Gaussian multiple-access channel at capacity, so that 

f^Ri = C (*) ， (15.395) 

where C{x) = \ log(l + x) and N is the receiver noise power. A 
new user of power Po wishes to join in. 

(a) At what rate can he send without disturbing the other users? 

(b) What should his power Po be so that the new users’ rate is 
equal to the combined communication rate C{mP/N) of all 
the other users? 

15.32 Converse for deterministic broadcast channel. A deterministic 

broadcast channel is defined by an input X and two outputs, Y\ 
and Y 2 , which are functions of the input X. Thus, Y\ = f\{X) and 
Y 2 = Let R\ and R 2 be the rates at which information can 

be sent to the two receivers. Prove that 


Ri < H{Y x ) 


(15.396) 
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R 2 < H(Y 2 ) (15.397) 

R\ + R2<H{Y u Y 2 ). (15.398) 


15.33 Multiple-access channel • Consider the multiple-access channel 
Y = Xi 

+ X 2 (mod 4)，where X x e {0, 1,2, 3}, X 2 e {0, 1}. 

(a) Find the capacity region (i?i, R 2 ). 

(b) What is the maximum throughput R\ + /? 2 ? 

15.34 Distributed source compression. Let 


Zi = 


1， P 

0 , q, 


Z2 = 


h P 

0， q, 


and let U = Z 1 Z 2 , V = Z\ + Z 2 . Assume that Z\ and Z 2 are 
independent. This induces a joint distribution on (U, V). Let 
(Ui, Vi) be i.i.d. according to this distribution. Sender 1 describes 
U n at rate R\, and sender 2 describes V n at rate /? 2 - 

(a) Find the Slepian-Wolf rate region for recovering (U n , V n ) 
at the receiver. 

(b) What is the residual uncertainty (conditional entropy) that 
the receiver has about (X n , Y n ). 


15.35 Multiple-access channel capacity with costs • The cost of using 
symbol x is r(x). The cost of a codeword x n is r{x n )= 

Y^=\ r ( x i)- A (2 nR , n) codebook satisfies cost constraint r if 
^ J2"=1 r (Xi(w)) < r for all w e 2 nR . 

(a) Find an expression for the capacity C(r) of a discrete mem¬ 
oryless channel with cost constraint r. 

(b) Find an expression for the multiple-access channel capacity 
region for x A 2 , p(y\x\, X 2 ), y) if sender X 1 has cost con¬ 
straint r\ and sender X 2 has cost constraint 厂 2 . 

(c) Prove the converse for part (b). 

15.36 Slepian-Wolf. Three cards from a three-card deck are dealt, one 
to sender X\, one to sender X 2 , and one to sender Z 3 . At what 
rates do X\, X 2 , and X 、 need to communicate to some receiver 
so that their card information can be recovered? 
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X n 2 


― ^ i(X n ') — ^ 

- ^ 7(^2) - ^ Decoder — ► (X^, Xg, Xg) 

— ^ k(X n 3 ) — - 


Assume that (Xu, 义 2 / ， ^ 3 /) are drawn i.i.d. from a uniform dis¬ 
tribution over the permutations of {1, 2, 3}. 


HISTORICAL NOTES 

This chapter is based on the review in El Gamal and Cover [186]. The 
two-way channel was studied by Shannon [486] in 1961. He derived inner 
and outer bounds on the capacity region. Dueck [175] and Schalkwijk 
[464, 465] suggested coding schemes for two-way channels that achieve 
rates exceeding Shannon’s inner bound; outer bounds for this channel 
were derived by Zhang et al. [596] and Willems and Hekstra [558]. 

The multiple-access channel capacity region was found by Ahlswede 
[7] and Liao [355] and was extended to the case of the multiple-access 
channel with common information by Slepian and Wolf [501]. Gaarder 
and Wolf [220] were the first to show that feedback increases the capac¬ 
ity of a discrete memoryless multiple-access channel. Cover and Leung 
[133] proposed an achievable region for the multiple-access channel with 
feedback, which was shown to be optimal for a class of multiple-access 
channels by Willems [557]. Ozarow [410] has determined the capacity 
region for a two-user Gaussian multiple-access channel with feedback. 
Cover et al. [129] and Ahlswede and Han [12] have considered the prob¬ 
lem of transmission of a correlated source over a multiple-access channel. 
The Slepian-Wolf theorem was proved by Slepian and Wolf [502] and was 
extended to jointly ergodic sources by a binning argument in Cover [122]. 

Superposition coding for broadcast channels was suggested by Cover 
in 1972 [119]. The capacity region for the degraded broadcast channel 
was determined by Bergmans [55] and Gallager [225]. The superposi¬ 
tion codes for the degraded broadcast channel are also optimal for the 
less noisy broadcast channel (Korner and Marton [324]), the more capa¬ 
ble broadcast channel (El Gamal [185]), and the broadcast channel with 
degraded message sets (Komer and Marton [325]). Van der Meulen [526] 
and Cover [121] proposed achievable regions for the general broadcast 
channel. The capacity of a deterministic broadcast channel was found by 
Gelfand and Pinsker [242, 243, 423] and Marton [377]. The best known 
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achievable region for the broadcast channel is due to Marton [377]; a sim¬ 
pler proof of Marton’s region was given by El Gamal and Van der Meulen 
[188]. El Gamal [184] showed that feedback does not increase the capac¬ 
ity of a physically degraded broadcast channel. Dueck [176] introduced 
an example to illustrate that feedback can increase the capacity of a mem¬ 
oryless broadcast channel; Ozarow and Leung [411] described a coding 
procedure for the Gaussian broadcast channel with feedback that increased 
the capacity region. 

The relay channel was introduced by Van der Meulen [528]; the capac¬ 
ity region for the degraded relay channel was determined by Cover and 
El Gamal [127]. Carleial [85] introduced the Gaussian interference chan¬ 
nel with power constraints and showed that very strong interference is 
equivalent to no interference at all. Sato and Tanabe [459] extended the 
work of Carleial to discrete interference channels with strong interference. 
Sato [457] and Benzel [51] dealt with degraded interference channels. The 
best known achievable region for the general interference channel is due 
to Han and Kobayashi [274]. This region gives the capacity for Gaussian 
interference channels with interference parameters greater than 1, as was 
shown in Han and Kobayashi [274] and Sato [458]. Carleial [84] proved 
new bounds on the capacity region for interference channels. 

The problem of coding with side information was introduced by Wyner 
and Ziv [573] and Wyner [570]; the achievable region for this problem 
was described in Ahlswede and Komer [13], Gray and Wyner [261], and 
Wyner [571],[572]. The problem of finding the rate distortion function 
with side information was solved by Wyner and Ziv [574]. The channel 
capacity counterpart of rate distortion with side information was solved by 
Gelfand and Pinsker [243]; the duality between the two results is explored 
in Cover and Chiang [113]. The problem of multiple descriptions is treated 
in El Gamal and Cover [187]. 

The special problem of encoding a function of two random variables 
was discussed by Korner and Marton [326], who described a simple 
method to encode the modulo 2 sum of two binary random variables. 
A general framework for the description of source networks may be 
found in Csiszar and Korner [148],[149]. A common model that includes 
Slepian-Wolf encoding, coding with side information, and rate distor¬ 
tion with side information as special cases was described by Berger and 
Yeung [54]. 

In 1989, Ahlswede and Dueck [17] introduced the problem of identi¬ 
fication via communication channels, which can be viewed as a problem 
where the sender sends information to the receivers but each receiver only 
needs to know whether or not a single message was sent. In this case, the 
set of possible messages that can be sent reliably is doubly exponential in 


HISTORICAL NOTES 611 


the block length, and the key result of this paper was to show that 2 2 " C 
messages could be identified for any noisy channel with capacity C. This 
problem spawned a set of papers [16, 18, 269, 434], including extensions 
to channels with feedback and multiuser channels. 

Another active area of work has been the analysis of MEMO (multiple- 
input multiple-output) systems or space-time coding, which use multiple 
antennas at the transmitter and receiver to take advantage of the diversity 
gains from multipath for wireless systems. The analysis of these multiple 
antenna systems by Foschini [217], Teletar [512], and Rayleigh and Cioffi 
[246] show that the capacity gains from the diversity obtained using mul¬ 
tiple antennas in fading environments can be substantial relative to the 
single-user capacity achieved by traditional equalization and interleav¬ 
ing techniques. A special issue of the IEEE Transactions in Information 
Theory [70] has a number of papers covering different aspects of this 
technology. 

Comprehensive surveys of network information theory may be found 
in El Gamal and Cover [186], Van der Meulen [526-528], Berger [53], 
Csiszar and Komer [149], Verdu [538], Cover [111], and Ephremides and 
Hajek [197]. 



CHAPTER 16 

INFORMATION THEORY 
AND PORTFOLIO THEORY 


The duality between the growth rate of wealth in the stock market and 
the entropy rate of the market is striking. In particular, we shall find the 
competitively optimal and growth rate optimal portfolio strategies. They 
are the same, just as the Shannon code is optimal both competitively and 
in the expected description rate. We also find the asymptotic growth rate 
of wealth for an ergodic stock market process. We end with a discussion 
of universal portfolios that enable one to achieve the same asymptotic 
growth rate as the best constant rebalanced portfolio in hindsight. 

In Section 16.8 we provide a “sandwich” proof of the asymptotic 
equipartition property for general ergodic processes that is motivated by 
the notion of optimal portfolios for stationary ergodic stock markets. 


16.1 THE STOCK MARKET: SOME DEFINITIONS 


A stock market is represented as a vector of stocks X = X 2 , …， X m ), 

Xi > 0, i = l, 2,..., m, where m is the number of stocks and the price 
relative Xi is the ratio of the price at the end of the day to the price at the 
beginning of the day. So typically, Xi is near 1. For example, Xi = 1.03 
means that the z'th stock went up 3 percent that day. 

Let X 〜 jF(x )， where F(x) is the joint distribution of the vector of 
price relatives. A portfolio b = (b\, b 2 ,..., b m ), bi > 0, = 1, is an 

allocation of wealth across the stocks. Here bi is the fraction of one’s 
wealth invested in stock i. If one uses a portfolio b and the stock vector 
is X, the wealth relative (ratio of the wealth at the end of the day to the 
wealth at the beginning of the day) is S = b’X = Y1T=i biX“ 

We wish to maximize S in some sense. But 5 is a random variable, 
the distribution of which depends on portfolio b, so there is controversy 
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FIGURE 16.1. Sharpe-Markowitz theory: set of achievable mean-variance pairs. 

over the choice of the best distribution for S. The standard theory of 
stock market investment is based on consideration of the first and second 
moments of S. The objective is to maximize the expected value of S 
subject to a constraint on the variance. Since it is easy to calculate these 
moments, the theory is simpler than the theory that deals with the entire 
distribution of S. 

The mean-variance approach is the basis of the Sharpe-Markowitz 
theory of investment in the stock market and is used by business ana¬ 
lysts and others. It is illustrated in Figure 16.1. The figure illustrates the 
set of achievable mean-variance pairs using various portfolios. The set 
of portfolios on the boundary of this region corresponds to the undomi¬ 
nated portfolios: These are the portfolios that have the highest mean for 
a given variance. This boundary is called the efficient frontier ， and if one 
is interested only in mean and variance, one should operate along this 
boundary. 

Normally, the theory is simplified with the introduction of a risk-free 
asset (e.g., cash or Treasury bonds, which provide a fixed interest rate 
with zero variance). This stock corresponds to a point on the Y axis 
in the figure. By combining the risk-free asset with various stocks, one 
obtains all points below the tangent from the risk-free asset to the efficient 
frontier. This line now becomes part of the efficient frontier. 

The concept of the efficient frontier also implies that there is a true 
price for a stock corresponding to its risk. This theory of stock prices, 
called the capital asset pricing model (CAPM), is used to decide whether 
the market price for a stock is too high or too low. Looking at the mean 
of a random variable gives information about the long-term behavior of 
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the sum of i.i.d. versions of the random variable. But in the stock market, 
one normally reinvests every day, so that the wealth at the end of n days 
is the product of factors, one for each day of the market. The behavior of 
the product is determined not by the expected value but by the expected 
logarithm. This leads us to define the growth rate as follows: 

Definition The growth rate of a stock market portfolio b with respect 
to a stock distribution Fix) is defined as 

W(b, F) = j logb ? x dF(x) = E (logb'X). (16.1) 

If the logarithm is to base 2, the growth rate is also called the doubling 
rate. 

Definition The optimal growth rate W*(F) is defined as 

W*(F) = max W(b, F), (16.2) 

b 

where the maximum is over all possible portfolios bi > 0, bi = 1. 

Definition A portfolio b* that achieves the maximum of W(b, F) is 
called a log-optimal portfolio or growth optimal portfolio. 

The definition of growth rate is justified by the following theorem, 
which shows that wealth grows as 2 nW . 

Theorem 16.1.1 Let Xi, X 2 ,... ,X n be i.i.d. according to F(x). Let 


5 ； = (16.3) 

i=l 

be the wealth after n days using the constant rebalanced portfolio b*. Then 

— log S* —> W* with probability 1. (16.4) 

n 

Proof: By the strong law of large numbers, 

1 1 n 

-log5 ： = - Vlogb^X,- (16.5) 

n n 

i=l 

—> W* with probability 1. (16.6) 

Hence, 5* = 2 nW \ □ 
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We now consider some of the properties of the growth rate. 

Lemma 16.1.1 W(b, F) is concave in b and linear in F. W*(F) is 
convex in F. 

Proof: The growth rate is 

W(b ， F) = j \ogWxdF(x). (16.7) 

Since the integral is linear in T 7 , so is VK(b, F). Since 

log ( 入 bi + (1 — A)b 2 ) / X > A log b[X + (1 — A) log b^X, (16.8) 

by the concavity of the logarithm, it follows, by taking expectations, that 
VK(b, F) is concave in b. Finally, to prove the convexity of W*(F) as a 
function of F, let F\ and be two distributions on the stock market and 
let the corresponding optimal portfolios be b*(Fi) and b*(F 2 ), respec¬ 
tively. Let the log-optimal portfolio corresponding to 入 Fi + (1 — 入 )F 2 be 
b*(AFi + (1 — 入） h). Then by linearity of W(b, F) with respect to F, 
we have 

W*(AFi + (1-A)F 2 ) 

=W(b*(AFi + (1 - A)F 2 ), XF { + (1 - X)F 2 ) (16.9) 

二入帅木 ⑽+ ^- 入 )/^)，/^) 

+ (1- X)W(b^(XF l + (1 - k)F 2 ), F 2 ) 

< 入 W(b*(A) ， A) + (1 - A)W*(b*(F 2 ),F 2 ), (16.10) 

since b*(Fi) maximizes W(b, F\) and 1)*(/^) maximizes W(b, /^). 口 

Lemma 16.1.2 The set of log-optimal portfolios with respect to a given 
distribution is convex. 

Proof: Suppose that bi andb 2 are log-optimal (i.e., W(bi, F) = W(b 2 , F) 
= W*(F)). By the concavity of VK(b, F) in b, we have 

W(Xbi + (1 — A)b 2 , F) > XW(bi, F) + (1 - X)W(b 2 , F) = W*(F). 

(16.11) 

Thus, 入 bi + (1 — 入 ) b 2 is also log-optimal. □ 

In the next section we use these properties to characterize the log- 
optimal portfolio. 


16.2 KUHN-TUCKER CHARACTERIZATION OF THE LOG-OPTIMAL PORTFOLIO 
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16.2 KUHN-TUCKER CHARACTERIZATION 
OF THE LOG-OPTIMAL PORTFOLIO 


Let B = {b e TZ m : b/ > 0, DLib/ = 1} denote the set of allowed port¬ 
folios. The determination of b* that achieves W*(F) is a problem of 
maximization of a concave function W(b, F) over a convex set B. The 
maximum may lie on the boundary. We can use the standard Kuhn-Tucker 
conditions to characterize the maximum. Instead, we derive these condi¬ 
tions from first principles. 


Theorem 16.2.1 The log-optimal portfolio h* for a stockmarketX 〜 F 
(i.e.y the portfolio that maximizes the growth rate W(b, F)) satisfies the 
following necessary and sufficient conditions: 

E {ik) = l 代 >0 ， 

< 1 ifb ； =0. (16.12) 


Proof: The growth rate W(b) = £(lnb r X) is concave in b, where b 
ranges over the simplex of portfolios. It follows that b* is log-optimum 
iff the directional derivative of W(-) in the direction from b* to any 
alternative portfolio b is nonpositive. Thus, letting = (1 — 入） b* + 入 b 
for 0 < A < 1, we have 

4rW(b x ) <0, beB. (16.13) 

d 入 人 =o+ 


These conditions reduce to (16.12) since the one-sided derivative at 入 = 
0+ of W(b 入 ） is 


d 

dX 


^(In(blX)) 


lim -E 

入|0入 


入 =0+ 


E 


E 


lim — In 

、入 |0 入 



， b ， X 、 


-1， 


(16.14) 

(16.15) 

(16.16) 


where the interchange of limit and expectation can be justified using the 
dominated convergence theorem [39]. Thus, (16.13) reduces to 


E 


WX\ 

b^x) 


-1 < 0 


(16.17) 
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for all b e B. If the line segment from b to b* can be extended beyond b* 
in the simplex, the two-sided derivative at 入 = 0 of VK(bx) vanishes and 
(16.17) holds with equality. If the line segment from b to b* cannot be 
extended because of the inequality constraint on b, we have an inequality 
in (16.17). 

The Kuhn-Tucker conditions will hold for all portfolios b e B if they 
hold for all extreme points of the simplex B since EiWX/b^X) is linear 
in b. Furthermore, the line segment from the jth extreme point (b : bj = 
l^bi = 0, i ^ j) to b* can be extended beyond b* in the simplex iff Z?* > 
0. Thus, the Kuhn-Tucker conditions that characterize the log-optimum 
b* are equivalent to the following necessary and sufficient conditions: 


E 


h* r X, 



if > 0, 

if =0. □ 


(16.18) 


This theorem has a few immediate consequences. One useful equiva¬ 
lence is expressed in the following theorem. 

Theorem 16.2.2 Let S* = b* r X be the random wealth resulting from 
the log-optimal portfolio b*. Let S = WX be the wealth resulting from any 
other portfolio b. Then 

5 S 

E In — < 0 for all S 4^ E — < 1 for all S. (16.19) 

S* S* 

Proof: From Theorem 16.2.1 it follows that for a log-optimal portfolio 

b*, 

E 




< 


(16.20) 


for all i. Multiplying this equation by bi and summing over /, we have 

⑽ Xi ^ 


i=\ 

which is equivalent to 


b*'X, 


<J^bi = 1 , 


bX S 

E - = E— < 1. 

b* r X S* ~ 

The converse follows from Jensen’s inequality, since 

^ log < log < log 1 = 0. □ 

5* 5* 


(16.21) 


(16.22) 


(16.23) 
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Maximizing the expected logarithm was motivated by the asymptotic 
growth rate. But we have just shown that the log-optimal portfolio, in 
addition to maximizing the asymptotic growth rate, also “maximizes” the 
expected wealth relative E(S/S*) for one day. We shall say more about 
the short-term optimality of the log-optimal portfolio when we consider 
the game-theoretic optimality of this portfolio. 

Another consequence of the Kuhn-Tucker characterization of the log- 
optimal portfolio is the fact that the expected proportion of wealth in 
each stock under the log-optimal portfolio is unchanged from day to day. 
Consider the stocks at the end of the first day. The initial allocation of 
wealth is b*. The proportion of the wealth in stock i at the end of the day 
is and the expected value of this proportion is 


b*，X 


biE： ^x 



(16.24) 


Hence, the proportion of wealth in stock i expected at the end of the 
day is the same as the proportion invested in stock i at the beginning of 
the day. This is a counterpart to Kelly proportional gambling, where one 
invests in proportions that remain unchanged in expected value after the 
investment period. 


16.3 ASYMPTOTIC OPTIMALITY OF THE LOG-OPTIMAL 
PORTFOLIO 

In Section 16.2 we introduced the log-optimal portfolio and explained its 
motivation in terms of the long-term behavior of a sequence of investments 
in a repeated independent versions of the stock market. In this section we 
expand on this idea and prove that with probability 1， the conditionally 
log-optimal investor will not do any worse than any other investor who 
uses a causal investment strategy. 

We first consider an i.i.d. stock market (i.e., Xi, X 2 , … ， X,! are i.i.d. 
according to F(x)). Let 

n 

^=n b ^ (16.25) 

i=\ 

be the wealth after n days for an investor who uses portfolio b, on day i. 
Let 


W* = max W(b, F) = max ElogWX (16.26) 

b b 
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be the maximal growth rate, and let b* be a portfolio that achieves the 
maximum growth rate. We only allow alternative portfolios b, that depend 
causally on the past and are independent of the future values of the stock 
market. 

Definition A nonanticipating or causal portfolio strategy is a sequence 
of mappings bi : with the interpretation that portfolio Z?/(xi, 

… ， x/_i) is used on day i. 

From the definition of W*, it follows immediately that the log-optimal 
portfolio maximizes the expected log of the final wealth. This is stated in 
the following lemma. 

Lemma 16.3.1 Let 5* be the wealth after n days using the log-optimal 
strategy b* on i.i.d. stocks, and let S n be the wealth using a causal portfolio 
strategy b/. Then 


Proof 

E log S* =nW* > ElogS n . 

(16.27) 

n 

max E log S n = max E log bJX/ 

bj ,b2, bi ,b2, …， b/i 

1 = 1 

(16.28) 


n 

=; max Elogb-CXi, X 2 ,..., X/. 

1 = 1 

-i)X,- 

(16.29) 


n 

i=l 

(16.30) 


= nW^, 

(16.31) 

and the 

maximum is achieved by a constant portfolio strategy b* 

. □ 


So far, we have proved two simple consequences of the definition of 
log-optimal portfolios: that b* (satisfying (16.12)) maximizes the expected 
log wealth, and that the resulting wealth 5* is equal to 2 nW * to first order 
in the exponent, with high probability. 

Now we prove a much stronger result, which shows that 5* exceeds 
the wealth (to first order in the exponent) of any other investor for almost 
every sequence of outcomes from the stock market. 


Theorem 16.3.1 (Asymptotic optimality of the log-optimal portfolio) 
Let Xi, X 2 , ..., X„ be a sequence of i.i.d. stock vectors drawn according 


16.4 SIDE INFORMATION AND THE GROWTH RATE 621 


to F(x). Let S* = YYi=\ b*’X/, where b* is the log-optimal portfolio, and 
let Sfi — UU be the wealth resulting from any other causal portfolio. 

Then 


lim sup 


1 , S n 

n l0g ^ 


< 0 with probability 1. 


(16.32) 


Proof: From the Kuhn-Tucker conditions and the log optimality of 5*, 
we have 

E— < 1. (16.33) 

S n 


Hence by Markov’s inequality, we have 


Hence, 


Pr(5„>45 ； )=Pr^>^ < i. 


/I S n 1 \ 1 


Setting t n = n 2 and summing over n, we have 

一 OO 


EPr (I log | > ^,< 


7T 

2=7 


2 


(16.34) 


(16.35) 


(16.36) 


Then, by the Borel-Cantelli lemma, 

/ 1 S n 2 log n \ 

Pr I — log — > ---, infinitely often I = 0. 

V ^ S* n / 


(16.37) 


This implies that for almost every sequence from the stock market, there 
exists an N such that for all n > N, ^ log || < 21 : gn . Thus, 


1 S n 

lim sup — log — < 0 with probability 1. □ 

n s n ~ 


(16.38) 


The theorem proves that the log-optimal portfolio will perform as well 
as or better than any other portfolio to first order in the exponent. 


16.4 SIDE INFORMATION AND THE GROWTH RATE 

We showed in Chapter 6 that side information Y for the horse race X can 
be used to increase the growth rate by the mutual information /(Z; Y). 





622 


INFORMATION THEORY AND PORTFOLIO THEORY 


We now extend this result to the stock market. Here, Y) is an upper 
bound on the increase in the growth rate, with equality if Z is a horse 
race. We first consider the decrease in growth rate incurred by believing 
in the wrong distribution. 


Theorem 16.4.1 LetX 〜 fix). Let bf be a log-optimal portfolio cor¬ 
responding to /(x), and let be a log-optimal portfolio corresponding 
to some other density g(x). Then the increase in growth rate AW by using 
bf instead of h g is bounded by 

AW = W(b f ,F)~ W(b g ,F) < D(f\\g). (16.39) 


Proof: We have 


AW = J f(x) logb}x- y /(x) logb^x 
b'x 

b / x 5 (x) f(x) 


/(x) log 
/(x) log 


b^x /(x) g(x) 


=f /(x)log^^ + D(/||g) 
J b^x /(x) 

(a) f Kx g( X ) 

<log / /(x)-^-^ + D(/||g) 

J b^x f(x) 

= log J g(x)^ + D(f\\g) 


<\ogl + D(f\\g) 


(16.40) 

(16.41) 

(16.42) 

(16.43) 

(16.44) 

(16.45) 

(16.46) 


= D(f\\g), 


(16.47) 


where (a) follows from Jensen’s inequality and (b) follows from the 
Kuhn-Tucker conditions and the fact that b g is log-optimal for g. □ 

Theorem 16.4.2 The increase AW in growth rate due to side informa¬ 
tion Y is bounded by 


AW < /(X; Y). 


(16.48) 
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Proof: Let (X ， y) 〜 /(x, y), where X is the market vector and Y is the 
related side information. Given side information Y = y, the log-optimal 
investor uses the conditional log-optimal portfolio for the conditional 
distribution f(x\Y = y). Hence, conditional on Y = y, we have, from 
Theorem 16.4.1 ， 


f f(x\Y = v) 

AW y=)； < D(f(x\Y = y)\\f(x))= / f(x\Y = y)log Jy dx. 

Jx /(x) 

(16.49) 

Averaging this over possible values of Y, we have 

AW< f f(y) f f(x\Y = y) log dxdy (16.50) 

Jy Jx /(X) 

= f \ /(j)/(x|7 = j)log /(X ^ ^ y) dxdy (16.51) 
JyJx /(x) fiy) 

= f f /(X, y) log dxdy (16.52) 

JyJx /(X)/(J) 

=/(X; Y). (16.53) 


Hence, the increase in growth rate is bounded above by the mutual infor¬ 
mation between the side information Y and the stock market X. □ 


16.5 INVESTMENT IN STATIONARY MARKETS 

We now extend some of the results of Section 16.4 from i.i.d. markets 
to time-dependent market processes. Let Xi, X 2 , … ， X n ，… be a vector¬ 
valued stochastic process with X/ > 0. We consider investment strategies 
that depend on the past values of the market in a causal fashion (i.e., b/ 
may depend on Xi, X 2 , … ， X/_i). Let 

n 

S n = ]~[bJ(Xi, X 2 ,..., X z _i)X/. (16.54) 

i=\ 

Our objective is to maximize E log S n over all such causal portfolio strate¬ 
gies {b/(-)}. Now 

n 

max E log S n = max E'logb-X/ (16.55) 

bi,b 2 ,..,b w ^b I (X 1 ,X 2 ,...,X / _ 1 ) 

l = \ 

n 

=J]£logb；%, (16.56) 
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where b* is the log-optimal portfolio for the conditional distribution of X/ 
given the past values of the stock market; that is, b*(xi, X 2 ,..., x/_i) is 
the portfolio that achieves the conditional maximum, which is denoted by 

max E[ logb f X ; -1(Xi, X 2 ,..., X ; _i) = (xi,x 2 ,..., x,_i)] 

b 

=W*(X / |xi,x 2 ,...,x i -_i). (16.57) 

Taking the expectation over the past, we write 

W*(X/|Xi,X 2 , … ， Xi—i) = E'maxE'flogb^X/IXi, X 2 , ..., X z _i] 

(16.58) 

as the conditional optimal growth rate, where the maximum is over all 
portfolio-valued functions b defined on Xi,..., X/_i ， Thus, the high¬ 
est expected log return is achieved by using the conditional log-optimal 
portfolio at each stage. Let 

W*(Xi,X 2 , ...,X n ) = max ElogS n , (16.59) 

bi,b2,...,b n 

where the maximum is over all causal portfolio strategies. Then since 
log S* = Y17=\ logb* ? X/, we have the following chain rule for W*: 


W*(X l ,X 2 ,...,X n ) = J^ W*(X ( ； |X 1 ,X 2 ,..., Xm). (16.60) 


This chain rule is formally the same as the chain rule for H. In some 
ways, W is the dual of H. In particular, conditioning reduces H but 
increases W. We now define the counterpart of the entropy rate for time- 
dependent stochastic processes. 


Definition The growth rate is defined as 


lim 




if the limit exists. 


(16.61) 


Theorem 16.5.1 For a stationary market, the growth rate exists and is 
equal to 

= lim W*(X n |X 1 ,X 2 ,...,X„_ 1 ). (16.62) 
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Proof: By stationarity, W^(X n \X\, X 2 ,..., X n _i) is nondecreasing in n. 
Hence, it must have a limit, possibly infinity. Since 

w*(Xi,x 2 ,...,x w ) ^ 1 

n n 


J]^(X / |X 1 ,X 2 ,...,X i _ 1 ), (16.63) 


it follows by the theorem of the Cesaro mean (Theorem 4.2.3) that the 
left-hand side has the same limit as the limit of the terms on the right-hand 
side. Hence, exists and 

= lim W * (Xl ’ X2 ’ … ’ X ") = lim ^(X„|Xi,X 2) ..., X n _i). □ 

n^-oo n n^-oo 

(16.64) 

We can now extend the asymptotic optimality property to stationary 
markets. We have the following theorem. 


Theorem 16.5.2 Consider an arbitrary stochastic process {X/}, Xi G 
TZ^ } conditionally log-optimal portfolios, b*(X z_1 ) and wealth 5*. Let S n 
be the wealth generated by any other causal portfolio strategy b/(X z_1 ). 
Then S n /S* is a positive supermartingale with respect to the sequence of 
a-fields generated by the past X\, X 2 ,..., X n . Consequently，there exists 
a random variable V such that 



V 



EV < 1 


with probability 1 


(16.65) 

(16.66) 


and 


Pr 


sup 


Sn 


> t 



Proof: S n /S* is a positive supermartingale because 


'5„+i(^ +1 ) 

x n 

=E 

「 (W n+l x n+1 )s n (x n ) 

x n 

U*+， +i ) 






5 *(^) lK + l X n + l 

S n (x n ) 

w )， 


(16.67) 


(16.68) 

(16.69) 

(16.70) 
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by the Kuhn-Tucker condition on the conditionally log-optimal portfolio. 
Thus, by the martingale convergence theorem, S n /S* has a limit, call it 
V, and EV < E(So/Sq) = 1. Finally, the result for sup(S n /S*) follows 
from Kolmogorov’s inequality for positive martingales. □ 

We remark that (16.70) shows how strong the competitive optimality 
of 5* is. Apparently, the probability is less than 1/10 that S n (X n ) will 
ever be 10 times as large as S*(X n ). For a stationary ergodic market, we 
can extend the asymptotic equipartition property to prove the following 
theorem. 

Theorem 16.5.3 (AEP for the stock market) Let Xi, X 2 ,, X n be a 
stationary ergodic vector-valued stochastic process. Let 5* be the wealth 
at time n for the conditionally log-optimal strategy，where 

n 

5 ： = n bfiXu X 2 , … ， X ; _!)X ; . (16.71) 

i=l 

Then 

1 , , 

— log 5* ^ with probability 1. (16.72) 

n 

Proof: The proof involves a generalization of the sandwich argument 
[20] used to prove the AEP in Section 16.8. The details of the proof (in 
Algoet and Cover [21]) are omitted. □ 

Finally, we consider the example of the horse race once again. The 
horse race is a special case of the stock market in which there are m 
stocks corresponding to the m horses in the race. At the end of the race, 
the value of the stock for horse i is either 0 or 0 (，the value of the odds 
for horse i. Thus, X is nonzero only in the component corresponding to 
the winning horse. 

In this case, the log-optimal portfolio is proportional betting, known as 
Kelly gambling (i.e., b* = p !)， and in the case of uniform fair odds (i.e .， 
Oi = m, for all i), 

W* = \ogm- H(X). (16.73) 

When we have a sequence of correlated horse races, the optimal portfolio 
is conditional proportional betting and the asymptotic growth rate is 


W^ = \ogm-H(X), 


( 16 . 74 ) 
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where H(X) = lim X 2 ,..., X n ) if the limit exists. Then Theo¬ 

rem 16.5.3 asserts that 

S: = 2 nW \ (16.75) 

in agreement with the results in chapter 6. 


16.6 COMPETITIVE OPTIMALITY OF THE LOG-OPTIMAL 
PORTFOLIO 


We now ask whether the log-optimal portfolio outperforms alternative 
portfolios at a given finite time n. As a direct consequence of the 
Kuhn-Tucker conditions, we have 

E^<h (16.76) 

and hence by Markov’s inequality, 

?r(S n > tS：) < l ~. (16.77) 

This result is similar to the result derived in Chapter 5 for the competitive 
optimality of Shannon codes. 

By considering examples, it can be seen that it is not possible to get 
a better bound on the probability that S n > S*. Consider a stock market 
with two stocks and two possible outcomes, 


(U2) = 



( 1 , 0 ) 


with probability 1 — 
with probability e. 


(16.78) 


In this market the log-optimal portfolio invests all the wealth in the first 
stock. [It is easy to verify that b = (1 ， 0) satisfies the Kuhn-Tucker con¬ 
ditions.] However, an investor who puts all his wealth in the second stock 
earns more money with probability 1—6. Hence, it is not true that with 
high probability the log-optimal investor will do better than any other 
investor. 

The problem with trying to prove that the log-optimal investor does 
best with a probability of at least ^ is that there exist examples like the 
one above, where it is possible to beat the log-optimal investor by a 
small amount most of the time. We can get around this by allowing each 
investor an additional fair randomization, which has the effect of reducing 
the effect of small differences in the wealth. 
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Theorem 16.6.1 (Competitive optimality) Let S* be the wealth at the 
end of one period of investment in a stock market X with the log-optimal 
portfolio, and let S be the wealth induced by any other portfolio. Let U* be 
a random variable independent of X uniformly distributed on [0, 2], and 
let V be any other random variable independent ofX and U* with V >0 
and EV = 1. Then 


Pr(V5 > [7*5*) < 


2 . 


(16.79) 


Remark Here U* and V correspond to initial “fair” randomizations of 
the initial wealth. This exchange of initial wealth 5o = 1 for “fair” wealth 
U* can be achieved in practice by placing a fair bet. The effect of the 
fair randomization is to randomize small differences, so that only the 
significant deviations of the ratio S/S* affect the probability of winning. 


Proof: We have 


VS 


Pr(V5 > f/*5*) = Pr ( — > [/* 


Pr(W > IT) 

where W = 癸 is a non-negative-valued random variable with mean 

Sn 


(16.80) 

(16.81) 


EW = E(V)E I — I < 


(16.82) 


by the independence of V from X and the Kuhn-Tucker conditions. Let 
F be the distribution function of W. Then since U* is uniform on [0, 2], 


Pr(W > C/*) 


< 


< 


2 , 


Pr(W > w)fu*(w) dw 

〕 

(16.83) 

i 

Pr(W > w)-dw 

(16.84) 

n2 1 - F(w) 

- dw 

)2 

(16.85) 

rt0 ° 1 - F(w) 1 

- dw 

)2 

(16.86) 

EW 

(16.87) 


(16.88) 
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using the easily proved fact (by integrating by parts) that 


EW = 



(1 — F(w)) dw 


for a positive random variable W. Hence, we have 

Pr(V5 > [7*5*) = Pr(W > IT) < □ 


(16.89) 


(16.90) 


Theorem 16.6.1 provides a short-term justification for the use of the 
log-optimal portfolio. If the investor’s only objective is to be ahead of his 
opponent at the end of the day in the stock market, and if fair randomiza¬ 
tion is allowed, Theorem 16.6.1 says that the investor should exchange his 
wealth for a uniform [0, 2] wealth and then invest using the log-optimal 
portfolio. This is the game-theoretic solution to the problem of gambling 
competitively in the stock market. 


16.7 UNIVERSAL PORTFOLIOS 

The development of the log-optimal portfolio strategy in Section 16.1 
relies on the assumption that we know the distribution of the stock vectors 
and can therefore calculate the optimal portfolio b*. In practice, though, 
we often do not know the distribution. In this section we describe a causal 
portfolio that performs well on individual sequences. Thus, we make no 
statistical assumptions about the market sequence. We assume that the 
stock market can be represented by a sequence of vectors xi, X 2 ,... G 
where X(j is the price relative for stock j on day i and X/ is the vector 
of price relatives for all stocks on day i. We begin with a finite-horizon 
problem, where we have n vectors xi,..., x ;1 . We later extend the results 
to the infinite-horizon case. 

Given this sequence of stock market outcomes, what is the best we 
can do? A realistic target is the growth achieved by the best constant 
rebalanced portfolio strategy in hindsight (i.e., the best constant rebal¬ 
anced portfolio on the known sequence of stock market vectors). Note 
that constant rebalanced portfolios are optimal against i.i.d. stock mar¬ 
ket sequences with known distribution, so that this set of portfolios is 
reasonably natural. 

Let us assume that we have a number of mutual funds, each of which 
follows a constant rebalanced portfolio strategy chosen in advance. Our 
objective is to perform as well as the best of these funds. In this section 
we show that we can do almost as well as the best constant rebalanced 
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portfolio without advance knowledge of the distribution of the stock 
market vectors. 

One approach is to distribute the wealth among a continuum of fund 
managers, each of which follows a different constantly rebalanced portfo¬ 
lio strategy. Since one of the managers will do exponentially better than 
the others, the total wealth after n days will be dominated by the largest 
term. We will show that we can achieve a performance of the best fund 
manager within a factor of . This is the essence of the argument for 
the infinite-horizon universal portfolio strategy. 

A second approach to this problem is as a game against a malicious 
opponent or nature who is allowed to choose the sequence of stock 
market vectors. We define a causal (nonanticipating) portfolio strategy 
b/(x z _i, …， xi) that depends only on the past values of the stock market 
sequence. Then nature, with knowledge of the strategy b/(x i_1 ), chooses a 
sequence of vectors X/ to make the strategy perform as poorly as possible 
relative to the best constantly rebalanced portfolio for that stock sequence. 
Let b*(x /2 ) be the best constantly rebalanced portfolio for a stock market 
sequence x n . Note that b*(x n ) depends only on the empirical distribution 
of the sequence, not on the order in which the vectors occur. At the end 
of n days, a constantly rebalanced portfolio b achieves wealth: 

n 

5„(b,x n ) = (16.91) 

i=l 

and the best constant portfolio b*(x /2 ) achieves a wealth 

n 

S*(x n ) = max U bt ^ (16.92) 

b i=l 

whereas the nonanticipating portfolio b/(x i_1 ) strategy achieves 

n 

見 ( x ") =「[ 6 ;( x, 一 〜卜 (16.93) 

/ =1 

Our objective is to find a nonanticipating portfolio strategy 6(.) = (bi, 
62 (X 1 )，..., bi(x l ~ 1 )) that does well in the worst case in terms of the ratio 
of S n to 5*. We will find the optimal universal strategy and show that this 
strategy for each stock sequence achieves wealth S n that is within a factor 

m —1 

V n ^ n— 2 of the wealth 5* achieved by the best constantly rebalanced 
portfolio on that sequence. This strategy depends on n, the horizon of 
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the game. Later we describe some horizon-free results that have the same 
worst-case asymptotic performance as that of the finite-horizon game. 


16.7.1 Finite-Horizon Universal Portfolios 


We begin by analyzing a stock market of n periods, where n is known 
in advance, and attempt to find a portfolio strategy that does well against 
all possible sequences of n stock market vectors. The main result can be 
stated in the following theorem. 

Theorem 16.7.1 For a stock market sequence x n = xi,. •. ， x n , X/ e 


TZ^ of length n with m assets, let S*(x n ) be the wealth achieved by the 
optimal constantly rebalanced portfolio on x n , and let S n (x n ) be the wealth 
achieved by any causal portfolio strategy b/(*) on x n ; then 


where 


S n (x n ) 

max min - = V n , 

bKO x i ， … ，知 S*(x n ) 


(16.94) 


V n 


E 


2 


-nH(^ . 


n\-\ - \-n m =n 


ri 2 , ..., n n 


(16.95) 


Using Stirling’s approximation, we can show that V n is on the order 
of n __ 2~, and therefore the growth rate for the universal portfolio on 
the worst sequence differs from the growth rate of the best constantly 
rebalanced portfolio on that sequence by at most a polynomial factor. 
The logarithm of the ratio of growth of wealth of the universal portfolio 
b to the growth of wealth of the best constant portfolio behaves like 
the redundancy of a universal source code. (See Shtarkov [496], where 
log V n appears as the minimax individual sequence redundancy in data 
compression.) 

We first illustrate the main results by means of an example for n = l. 
Consider the case of two stocks and a single day. Let the stock vector for 
the day be x = (x\,X 2 ). If x\ > X 2 , the best portfolio is one that puts all 
its money on stock 1, and if X 2 > x\, the best portfolio puts all its money 
on stock 2. (If x\ = X 2 , all portfolios are equivalent.) 

Now assume that we must choose a portfolio in advance and our oppo¬ 
nent can choose the stock market sequence after we have chosen our 
portfolio to make us do as badly as possible relative to the best portfolio. 
Given our portfolio, the opponent can ensure that we do as badly as pos¬ 
sible by making the stock on which we have put more weight equal to 0 
and the other stock equal to 1. Our best strategy is therefore to put equal 
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weight on both stocks, and with this, we will achieve a growth factor at 
least equal to half the growth factor of the best stock, and hence we will 
achieve at least half the gain of the best constantly rebalanced portfolio. 
It is not hard to calculate that V n = 2 when n = \ and m = 2 in equation 
(16.94). 

However, this result seems misleading, since it appears to suggest that 
for n days, we would use a constant uniform portfolio, putting half our 
money on each stock every day. If our opponent then chose the stock 
sequence so that only the first stock was 1 (and the other was 0) every 
day, this uniform strategy would achieve a wealth of 1/2' and we would 
achieve a wealth only within a factor of 2 n of the best constant portfolio, 
which puts all the money on the first stock for all time. 

The result of the theorem shows that we can do significantly better. 
The main part of the argument is to reduce a sequence of stock vectors to 
the extreme cases where only one of the stocks is nonzero for each day. 
If we can ensure that we do well on such sequences, we can guarantee 
that we do well on any sequence of stock vectors, and achieve the bounds 
of the theorem. 

Before we prove the theorem, we need the following lemma. 

Lemma 16.7.1 For pu P 2 , …， Pm > 0 and qu q 2 , …， qm > 0, 

2^i=[ Pi 、 . Pi n r 

- 2 min — . (16.96) 

Z^/=l i qi 

Proof: Let I denote the index i that minimizes the right-hand side in 
(16.96). Assume that pi > 0 (if pj = 0, the lemma is trivially true). Also, 
if qi = 0, both sides of (16.96) are infinite (all the other %’s must also 
be zero), and again the inequality holds. Therefore, we can also assume 
that qi > 0. Then 

E/ll Pi _ Pi l + J2ijLi(Pi/Pi) PI 

Ta=\ 9 / \ + 辛 Mi/qi) — qi 

because 

Pi ^ Pi Pi ^ 中 

—— ^ —— — > —— ^ —— 

qi ~ qi Pi ~ qi 

for all i. □ 

First consider the case when n = L The wealth at the end of the first 
day is 


(16.97) 

(16.98) 


5i(x) = b r x, 
5i(x) = Wx 


(16.99) 

(16.100) 





and 
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^i(x) _ Y.bjXj 
5l(x) J2 b i x i 

We wish to find max^ minb, x • Nature should choose x = e“ where e/ is 
the /th basis vector with 1 in the component i that minimizes 备 ， and the 

investor should choose b to maximize this minimum. This is achieved by 
choosing b = ( 丄，丄 ， • • • ， 丄 ）. 

The important point to realize is that 

S n (x n ) _ UU 
5„(x n ) _ nLi b ; x ， 

can also be rewritten in the form of a ratio of terms 

§ n (x n ) _ bV 

S n (x n ) b’x’’ 

where b, b, x f e VJ 1 ^. Here the m n components of the constantly rebal¬ 
anced portfolios b are all of the product form b^b^ 2 - - - b: m . One wishes 
to find a universal b that is uniformly close to the b’s corresponding to 
constantly rebalanced portfolios. 

We can now prove the main theorem (Theorem 16.7.1). 

Proof of Theorem 16.7.1 : We will prove the theorem for m = 2. The 
proof extends in a straightforward fashion to the case m > 2. Denote the 
stocks by 1 and 2. The key idea is to express the wealth at time n, 

n 

Sn(x n ) = n b / x - (16.104) 

i=l 

which is a product of sums, into a sum of products. Each term in the sum 
corresponds to a sequence of stock price relatives for stock 1 or stock 
2 times the proportion bn or ba that the strategy places on stock 1 or 
stock 2 at time i. We can therefore view the wealth S n as a sum over 
all 2 n possible n-sequences of Ts and 2，s of the product of the portfolio 
proportions times the stock price relatives: 

n n n 

w)= e \\ b iji x ui= e n^.n Xij r (16.105) 

j n e{l,2} n i=l j n e{\,2} n i=\ i=l 


(16.102) 


(16.103) 


> min 


bi 

bi 


(16.101) 
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If we let w{j n ) denote the product ^ =1 fc/ 力 ， the total fraction of wealth 
invested in the sequence j n , and let 


x(j n ) = Y[ x Ui (16.106) 

i=\ 

be the corresponding return for this sequence, we can write 

S„{x n )= 0(，). (16.107) 

j n e{l,2} n 

Similar expressions apply to both the best constantly rebalanced portfolio 
and the universal portfolio strategy. Thus, we have 

S n (x n ) E/" 6{1 21 " ^U n )xU") 

1 ' 1 - ， (16.108) 

5 ； t(X n ) Ey» 6 {l,2}" U>*U n )x(j n ) 

where w n is the amount of wealth placed on the sequence j n by the 
universal nonanticipating strategy, and w*(j n ) is the amount placed by the 
best constant rebalanced portfolio strategy. Now applying Lemma 16.7.1, 
we have 


5«(x w ) . w{j n )x{j n ) . w(j n ) 

- > min - = min - 

5*(x w ) j n w*(j n )x(j n ) j n w*(j n ) 


(16.109) 


Thus, the problem of maximizing the performance ratio S n /S* is reduced 
to ensuring that the proportion of money bet on a sequence of stocks by 
the universal portfolio is uniformly close to the proportion bet by b*. As 
might be obvious by now, this formulation of S n reduces the n-period 
stock market to a special case of a single-period stock market — there are 
2 n stocks, one invests w(j n ) in stock j n and receives a return x(j n ) for 
stock j n , and the total wealth S n is w(j n )x(j n ). 

We first calculate the weight w*(j n ) associated with the best constant 
rebalanced portfolio b*. We observe that a constantly rebalanced portfolio 
b results in 


W{j n ) = \\b ih =b k {\- b) n ~ k , (16.110) 


where k is the number of times 1 appears in the sequence j n . Thus, w(j n ) 
depends only on k ，the number of Vs in j n • Fixing attention on j n , we 
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find by differentiating with respect to b that the maximum value 





which is achieved by 



(16.111) 

(16.112) 

(16.113) 


Note that w*(j n ) > 1, reflecting the fact that the amount “bet” on 
j n is chosen in hindsight, thus relieving the hindsight investor of the 
responsibility of allocating his investments w*(j n ) to sum to 1. The causal 
investor has no such luxury. How can the causal investor choose initial 
investments w(j n ), w{j n ) = 1 , to protect himself from all possible j n 
and hindsight-determined w*(j n )? The answer will be to choose w(j n ) 
proportional to w*(j n ). Then the worst-case ratio of w(j n )/w*(j n ) will 
be maximized. To proceed, we define V n by 




E 

j n 

n 

E 


k=0 


^(j n )\ k(jn) fn-k(j n y 


-k{j n ) 



(16.114) 


(16.115) 


and let 


MD = v n 


^k(j n )\ k(jn) (n- kU n )\ n ~ kijn) 


(16.116) 


It is clear that w(j n ) is a legitimate distribution of wealth over the 2 n 
stock sequences (i.e., w(j n ) > 0 and J^j n ^U n ) — !)• Here V n is the 
normalization factor that makes w(j n ) a probability mass function. Also, 
from (16.109) and (16.113), for all sequences x n , 


Sn(x n ) . W(j n ) 
> min - 


5*(x«) - j- w*(j n ) 


k \krn—k\n—k 


min 




k b* k (l - b*) n ~ k 


> V n , 


(16.117) 

(16.118) 
(16.119) 
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where (16.117) follows from (16.109) and (16.119) follows from (16.112). 
Consequently, we have 


§ ( x «) 

T^ n 4^)- K - (16 . 120) 


We have thus demonstrated a portfolio on the 2 n possible sequences of 
length n that achieves wealth S n (x n ) within a factor V n of the wealth 
S*(x n ) achieved by the best constant rebalanced portfolio in hindsight. To 
complete the proof of the theorem, we show that this is the best possible, 
that is, that any nonanticipating portfolio b/(x i_1 ) cannot do better than 
a factor V n in the worst case (i.e., for the worst choice of x n ). To prove 
this, we construct a set of extremal stock market sequences and show that 
the performance of any nonanticipating portfolio strategy is bounded by 
V n for at least one of these sequences, proving the worst-case bound. 

For each j n e {1, 2}' we define the corresponding extremal stock mar¬ 
ket vector x n (j n ) as 


f (i，oy if；, = i, nfi19n 

X/O/) - (0i 1)f if j. = 2? (16.121) 

Let ei = (1, 0 )’， e 2 = (0, 1)’ be standard basis vectors. Let 

ic = {x(r):/G{l,2r,x i7i . =e 7 -.} (16.122) 


be the set of extremal sequences. There are 2 n such extremal sequences, 
and for each sequence at each time, there is only one stock that yields 
a nonzero return. The wealth invested in the other stock is lost. There¬ 
fore, the wealth at the end of n periods for extremal sequence x tl (j n ) 
is the product of the amounts invested in the stocks j\, 7 * 2 ,..., j n ， [i.e., 
S n (x n (j n )) = bj t = w(j n )]. Again, we can view this as an investment 
on sequences of length n, and given the 0-1 nature of the return, it is 
easy to see for x n e 1C that 

J^S n (x n (j n )) = 1. (16.123) 

尸 

For any extremal sequence x n (j n ) g /C, the best constant rebalanced port¬ 
folio is 


b W)) = (16.124) 
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where n\{j n ) is the number of occurrences of 1 in the sequence j n • The 
corresponding wealth at the end of n periods is 


\w)) 


fn2(j n )Y 2U，，) w(j n ) 


V n 


(16.125) 


from (16.116) and it therefore follows that 


E w)= 

x n eJC 


Vn 


Yw(j n )=—. 

9 % 


(16.126) 


We then have the following inequality for any portfolio sequence {b/}『 =1 , 
with S n (x n ) defined as in (16.104): 


S n (x n ) 


min 


x n eJC S*(x n ) 


x n eJC 

S n (i n ) 

(16.127) 


=E 

x n eJC 

Sn(X n ) 

(16.128) 

'I2x n eJC 巧 ( x ") 


1 

(16.129) 

'Hx n eK. ^( x,2 ) 

= K ， 


(16.130) 


where the inequality follows from the fact that the minimum is less than 
the average. Thus, 


臓 

b x n eJC S*(x n ) ~ 


□ 


(16.131) 


The strategy described in the theorem puts mass on all sequences of 
length n and is clearly dependent on n. We can recast the strategy in 
incremental terms (i.e., in terms of the amount bet on stock 1 and stock 
2 at time 1), then, conditional on the outcome at time 1, the amount bet 
on each of the two stocks at time 2, and so on. Consider the weight 
bi,\ assigned by the algorithm to stock 1 at time i given the previous 
sequence of stock vectors x i_1 . We can calculate this by summing over 
all sequences j n that have a 1 in position giving 



E/'-ieM-- 1 w(/ _1 l)xO' !_1 ) 
~ EjieMi WUW- 1 ) ~ 


(16.132) 
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where 


W)= L w(j n ) 

尸: /[尸 

is the weight put on all sequences j n that start with j\ and 


(16.133) 


X U l ! ) — Y\ Xk Jk 


(16.134) 


k=\ 


is the return on those sequences as defined in (16.106). 

Investigation of the asymptotics of V n reveals [401 ， 496] that 


m—1 

H ' /幻 r(m/2)/V^F 


for m assets. In particular, for m = 2 assets, 


(16.135) 



and 


< K < 


2y/n + 1 




(16.136) 


(16.137) 


for all n [400]. Consequently, for m = 2 stocks, the causal portfolio strat¬ 
egy b/(x /_1 ) given in (16.132) achieves wealth S n (x n ) such that 


for all market sequences x n . 


網 … 1 


2vtT+T 


(16.138) 


16.7.2 Horizon-Free Universal Portfolios 

We describe the horizon-free strategy in terms of a weighting of different 
portfolio strategies. As described earlier, each constantly rebalanced port¬ 
folio b can be viewed as corresponding to a mutual fund that rebalances 
the m assets according to b. Initially, we distribute the wealth among 
these funds according to a distribution /x(b), where d/x(b) is the amount 
of wealth invested in portfolios in the neighborhood db of the constantly 
rebalanced portfolio b. 
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5 ;I (b,x M ) = Y\ (16.139) 


be the wealth generated by a constant rebalanced portfolio b on the stock 
sequence x n . Recall that 


S*(x n ) = max S n (b, x n ) (16.140) 

beB 


is the wealth of the best constant rebalanced portfolio in hindsight. 
We investigate the causal portfolio defined by 


b i+ i(x l ) 


/ B b5,(b,x ; )<i/x(b) 


We note that 

b- +1 (x ; )x !+ i 


f B Si(b,x') dfi(b )' 

/ g b’x,. +1 5Kb，xO d/x(b) 

f B S i+ i(b,x i+l ) d/xjb) 
/ s 5/(b,x0 i/x(b) 


(16.141) 


(16.142) 

(16.143) 


Thus, the product Y[ 6-x/ telescopes and we see that the wealth S n (x n ) 
resulting from this portfolio is given by 


夂 (x") = n 6 W _1 ) x '. (16.144) 

Z = 1 

=f S n (b,x n ) (16.145) 

JbeB 

There is another way to interpret (16.145). The amount given to port¬ 
folio manager b is J/x(b), the resulting growth factor for the manager 
rebalancing to b is 5(b, x n ), and the total wealth of this batch of invest¬ 
ments is 

§ n ( x n )= f S n (b,x n )dfi(b). (16.146) 

Jb 


Then b/+i, defined in (16.141), is the performance-weighted total “buy 
order” of the individual portfolio manager b. 
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So far, we have not specified what distribution /x(b) we use to apportion 
the initial wealth. We now use a distribution /x that puts mass on all 
possible portfolios, so that we approximate the performance of the best 
portfolio for the actual distribution of stock price vectors. 

In the next lemma, we bound S n /S* as a function of the initial wealth 
distribution /x(b). 


Lemma 16.7.2 Let S*(x tl ) in 16.140 be the wealth achieved by the best 
constant rebalanced portfolio and let S n (x n ) in (16.144) be the wealth 
achieved by the universal mixed portfolio b(-), given by 

f b5 z (b, x l )d/x(b) 


b/+i(x ’） 


f 5/(b, x z ) d/x(b) 


Then 


Proof: 




R(X ”） -下 
As before, we can write 

s ： (x n ) = j2^u n )^u n ), 


(16.147) 


(16.148) 


(16.149) 


where w*(j n ) = YYi=i ^ i s the amount invested on the sequence j n and 
x(j n ) = YYi=i x iji i s the corresponding return. Similarly, we can write 





b r Xi d/x(b) 


dfi(b) 


(16.150) 

(16.151) 

(16.152) 


= J^w(j n )x(D ， 

j n 

where w(j n ) = / EI^i dfx(b). Now applying Lemma 16.7.1, we have 

5„(X") Ejn Hj n )x(j n ) 




> min 


n W*(j n )x(j n ) 

w(j n )xU n ) 


j n w*(j n )x(j n ) 

Ib n"=i b ii 咖 ( b ) 


min 

j n 


uu 


□ 


(16.153) 

(16.154) 

(16.155) 
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We now apply this lemma when /x(b) is the Dirichlet(^) distribution. 

Theorem 16.7.2 For the causal universal portfolio ), / = 1, 2, •. 
given in (16.141)，with m = 2 stocks and dfi(b) the Dirichlet(^ y distri¬ 
bution, we have 

S n {x n ) > 1 

s*(x n ) - 2V^TT ? 

for all n and all stock sequences x n . 

Proof: As in the discussion preceding (16.112), we can show that the 
weight put by the best constant portfolio b* on the sequence j n is 

= r 轉 /n ' ⑼. 156 ) 


where k is the number of indices where ji = 1. We can also explicitly 
calculate the integral in the numerator of (16.148) in Lemma 16.7.2 for 
the Dirichlet(^) density, defined for m variables as 


J/x(b) 


r(f) 

丽 


U b 7 d ^ 


(16.157) 


where V(x) = e~ l t x ~ x dt denotes the gamma function. For simplicity, 
we consider the case of two stocks, in which case 


1 1 

d^(b) = 一一 ; db, 0 < & < 1, (16.158) 

7T \Jb{\ _ Z?) 


where b is the fraction of wealth invested in stock 1. Now consider any 
sequence j n e {1, 2} n , and consider the amount invested in that sequence, 


b{j n ) = Y\b h =b l {\-b) n -\ 


(16.159) 


where / is the number of indices where ji = 1. Then 


b(j n )d l i(b)= / b i a - b) 


n—l 


n ^b(l - b) 


db (16.160) 











642 


INFORMATION THEORY AND PORTFOLIO THEORY 


丌 


b l ~Hl - bY 1 - 1 - ¥ db 


-BU + -,n-l + - 


(16.161) 

(16.162) 


where S (入 1 ，入 2) is the beta function, defined as 

S(Ai,A 2 )= f x^-\l -x) X2 ~ l dx 
Jo 


r ( 入 or ( 入 2 ) 

r (入 1 + 入 2 ) 


(16.163) 

(16.164) 


and 

/»oo 

r ( 入 ） =/ X x ~ l e~ x dx. (16.165) 

Jo 

Note that for any integers, F(n + l) = n\ and V(n + ^) = h3 ' 5 "^ n ~ 1 ^ 

We can calculate B(l + n — l + by means of simple recursion 
using integration by parts. Alternatively, using (16.164), we obtain 



Combining all the results with Lemma 16.7.2, we have 


S n (x n ) 

5*(x«) 


> min 

_ 尸 


> min 


Ib nLi b h 咖⑼ 
■^B(l + « — / + ^) 

^—nH{l/n) 


> 


2\Jn + 1 5 


using the results in [135, Theorem 2]. 


It follows for m = 2 stocks that 


I- 


V2 jt 




(16.166) 


(16.167) 


(16.168) 

(16.169) 

□ 


(16.170) 
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0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

Proportion b of wealth in HPQ 


for all n and all market sequences x\, X 2 ,..., x n . Thus, good mimmax per¬ 
formance for all n costs at most an extra factor \/2n over the fixed horizon 
minimax portfolio. The cost of universality is V n , which is asymptotically 
negligible in the growth rate in the sense that 

-lnS n (x") - -ln5*(x n ) > - In -^= ^ 0. (16.171) 

n n n ^/2n 

Thus, the universal causal portfolio achieves the same asymptotic growth 
rate of wealth as the best hindsight portfolio. 

Let’s now consider how this portfolio algorithm performs on two real 
stocks. We consider a 14-year period (ending in 2004) and two stocks, 
Hewlett-Packard and Altria (formerly, Phillip Morris), which are both 
components of the Dow Jones Index. Over these 14 years, HP went up by 
a factor of 11.8, while Altria went up by a factor of 11.5. The performance 
of the different constantly rebalanced portfolios that contain HP and Altria 
are shown in Figure 16.2. The best constantly rebalanced portfolio (which 
can be computed only in hindsight) achieves a growth of a factor of 18.7 
using a mixture of about 51% HP and 49% Altria. The universal portfolio 
strategy described in this section achieves a growth factor of 15.7 without 
foreknowledge. 



64208642 
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FIGURE 16.2. Performance of different constant rebalanced portfolios b for HP and Altria. 
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16.8 SHANNON-MCMILLAN-BREIMAN THEOREM 
(GENERAL AEP) 


The AEP for ergodic processes has come to be known as the Shan¬ 
non -McMillan -Breiman theorem. In Chapter 3 we proved the AEP for 
i.i.d. processes. In this section we offer a proof of the theorem for a 
general ergodic process. We prove the convergence of ^ log p(X n ) by 
sandwiching it between two ergodic sequences. 

In a sense, an ergodic process is the most general dependent process for 
which the strong law of large numbers holds. For finite alphabet processes, 
ergodicity is equivalent to the convergence of the 她 -order empirical 
distributions to their marginals for all k. 

The technical definition requires some ideas from probability theory. To 
be precise, an ergodic source is defined on a probability space (Q, B, P), 
where is a cr-algebra of subsets of Q and P is a probability measure. 
A random variable X is defined as a function X(co), G on the prob¬ 
ability space. We also have a transformation T : Q ^ Q, which plays 
the role of a time shift. We will say that the transformation is stationary 
if P{TA) = P{A) for all A e B. The transformation is called ergodic if 
every set A such that TA = A, a.e” satisfies P(A) = 0 or 1. If 7 is station¬ 
ary and ergodic, we say that the process defined by X n (co) = X{T n co) is 
stationary and ergodic. For a stationary ergodic source, Birkhoff’s ergodic 
theorem states that 


1 C 

-^Xiico) ^ EX = XdP with probability 1. (16.172) 

i = \ 

Thus, the law of large numbers holds for ergodic processes. 

We wish to use the ergodic theorem to conclude that 

1 i n ~ l 

— \ogp(X 0 ,X l ,...,X n _ l ) = — TlogpiXilX 1 - 1 ) 

n n 

i=0 

4 lim E[-logp(X n \X"~ 1 )]. (16.173) 

n—oo u 

But the stochastic sequence p(Xi\X l Q ~ l ) is not ergodic. However, the 
closely related quantities p^XilX^zl) and p(Xi\X l S^) are ergodic and 
have expectations easily identified as entropy rates. We plan to sandwich 
p(Xi\X l 0 ~ l ) between these two more tractable processes. 
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We define the 众 th-order entropy H k as 

H k = E{-\og p(X k \X k _ u X 卜 2 ,… ， X 0 )} (16.174) 

=£ {- log p(X 0 \X^, Z_ 2 ,..., X_ fe )}, (16.175) 

where the last equation follows from stationarity. Recall that the entropy 
rate is given by 


H = lim H k 

(16.176) 

k—oo 

1 n 一 1 

=lim - H k . 

n—>oo n ^ 

(16.177) 


k=0 

Of course, H k \ H by stationarity and the fact that conditioning does 
not increase entropy. It will be crucial that H k \ H = H°°, where 

H°° = E {-logp(X 0 \X. u X 一 2 , •••)}• (16.178) 

The proof that H°° = H involves exchanging expectation and limit. 

The main idea in the proof goes back to the idea of (conditional) propor¬ 
tional gambling. A gambler receiving uniform odds with the knowledge of 
the k past will have a growth rate of wealth log \X\ — H k ， while a gambler 
with a knowledge of the infinite past will have a growth rate of wealth 
of log \X\ — H°°. We don’t know the wealth growth rate of a gambler 
with growing knowledge of the past Xq, but it is certainly sandwiched 
between log \PC\ — H k and log \X\ — H°°. But H k \ H = H°°. Thus, the 
sandwich closes and the growth rate must be log \X\ — H. 

We will prove the theorem based on lemmas that will follow the proof. 

Theorem 16.8.1 (AEP: Shannon-McMillan-Breiman Theorem) If 
H is the entropy rate of a finite-valued stationary ergodic process {X n }, 
then 

—— log p(Xo, ••• ， X n -\) H with probability 1. (16.179) 

n 

Proof: We prove this for finite alphabet X\ this proof and the proof for 
countable alphabets and densities is given in Algoet and Cover [20]. We 
argue that the sequence of random variables —j x log p{X^~ x ) is asymptot¬ 
ically sandwiched between the upper bound H k and the lower bound H°° 
for all k >0. The AEP will follow since H k —> H°° and H°° = H. The 
/:th-order Markov approximation to the probability is defined for n > /: as 

n—l 

p k (X n 0 ~ l ) = p(X k 0 - 1 ) ]~[ piXilXizl). (16.180) 

i=k 
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From Lemma 16.8.3 we have 


l im supllog^^<0, 
n^oo n 6 p(X n ~ l ) - 


(16.181) 


which we rewrite, taking the existence of the limit ^ log p k {X^) into 
account (Lemma 16.8.1), as 


1 1 1 1 
lim sup — log - —j— < lim — log 


p{Xl~ l ) -«-oon p k {X n 0 - x ) 
for 众 =1, 2, ... . Also, from Lemma 16.8.3, we have 

lim sup - log —— n l ° —i <0 ， 

n->oo ^ v v \ 


H K 


piX^lXZ^) 


which we rewrite as 


(16.182) 


(16.183) 


lim inf — log 


> 


lim — log 


n p(X n 0 - 1 ) - n piX^lXzL) 

from the definition of H°° in Lemma 16.8.1. 

Putting together (16.182) and (16.184)，we have 


H°° (16.184) 


H°° < lim inf —— log < lim sup —— log 


< H k for all k. 


(16.185) 


But by Lemma 16.8.2, H k H°° = H. Consequently, 


lim - 一 log〆％) = H. U 
n 


(16.186) 


We now prove the lemmas that were used in the main proof. The first 
lemma uses the ergodic theorem. 

Lemma 16.8.1 (Markov approximations ) For a stationary ergodic 
stochastic process {X n }, 

— —log p k (X f Q~ l ) H k with probability 1, (16.187) 


一 log p(Xq~ 1 \XzIq) —> H°° with probability 1. (16.188) 
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Proof: Functions Y n = f{X n _ OQ ) of ergodic processes {Xi} are ergodic 
processes. Thus, log p{X n \X^z\) and log p(X n \X n -\, X n — 2 , …，） are also 
ergodic processes, and 

1 1 1 W_1 

— \ogp\X n - x ) = —log p{X k 0 ~ l ) - - Vlog p(Xi\X\-_l) (16.189) 
n n n ^ 

i=k 

^0 + H k with probability 1, (16.190) 

by the ergodic theorem. Similarly, by the ergodic theorem, 

1 i n ~ x 

—logp(^- 1 |Z_ 1 ,Z_ 2 ,...) = — TlogpiXilX^uXi^,...) 

n n ^ 

i=0 

(16.191) 

^ H°° with probability 1. □ (16.192) 


Lemma 16.8.2 (No gap) H k \ H°° and H = H°°. 

Proof: We know that for stationary processes, H k \ H, so it remains 
to show that H k \ H°°, thus yielding H = H°°. Levy’s martingale 
convergence theorem for conditional probabilities asserts that 

P(xo\^Z l k ) with probability 1 (16.193) 

for all xo G X. Since X is finite and p log p is bounded and continuous in 
p for all 0 < p < 1, the bounded convergence theorem allows interchange 
of expectation and limit, yielding 


lim H k = lim E 

k—oo k—^oo 


- J2 P^o\XZ l k )\ogp(x 0 \XZl) 


(16.194) 


= E\-J2 PixolXZ^logpixolXZ^) (16.195) 

= H°°. (16.196) 


Thus, H k \H 


H°°. 


□ 
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Lemma 16.8.3 (Sandwich) 


1 p\X n - x ) 
lim sup - log -- < 0, 

n^oo n ~ 

Hmsupllog PiX f l \ 

n pix^ixzL) 


< 0. 


Proof: Let A be the support set of p(X^~ l ). Then 


E 


p\X n ~ l ) 

p{X n ~ x ) 


E p^- l / {xV) 


E 财 1 

p\A) 


P(H 


< 1. 


(16.197) 

(16.198) 

(16.199) 

(16.200) 

(16.201) 

(16.202) 


Similarly, let B(XzIq) denote the support set of p(.|X 二 ^ Then we have 

P(X n 0 ~ l ) 


E 


Pd 1 ) 

p(x n 0 - l \xzL) 


E 


E 


E 


< 1 . 


E 


Pix^xzl) 


x_ 


x^eBiXZlo) 


x^eBiXzl,) 


(16.203) 

(16.204) 

(16.205) 

(16.206) 


By Markov’s inequality and (16.202), we have 




< 


tn 


(16.207) 





















or 
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Pr^-log^^log, 




< 


tn 


(16.208) 


Letting t n = n 2 and noting that Li ^ < oc, we see by the 
Borel-Cantelli lemma that the event 


I log zW,I log , 

n p(X n ~ l ) n 


(16.209) 


occurs only finitely often with probability 1. Thus, 

1 p^(X n ~^) 

lim sup — log - j— < 0 with probability 1. (16.210) 

n P(X^) 

Applying the same arguments using Markov’s inequality to (16.206), we 
obtain 


iimsup ^ iog ^ 4 fe 


proving the lemma. 


< 


0 with probability 1 ， (16.211) 


□ 


The arguments used in the proof can be extended to prove the AEP for 
the stock market (Theorem 16.5.3). 


SUMMARY 

Growth rate. The growth rate of a stock market portfolio b with 
respect to a distribution F(x) is defined as 

W(b, F) = j logb'x dF(x) = E (logb’x). (16.212) 

Log-optimal portfolio. The optimal growth rate with respect to a dis¬ 
tribution F{x) is 

W^(F) = max W(b, F). (16.213) 

b 
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The portfolio b* that achieves the maximum of W(b, F) is called the 
log-optimal portfolio. 

Concavity. W(b, F) is concave in b and linear in F. W*(F) is convex 
in F. 

Optimality conditions. The portfolio b* is log-optimal if and only if 


E 


b*'X, 


1 if bf > 0, 


< 1 if b* = 0. (16.214) 

Expected ratio optimality. If S* = 1X1=1 b^X/, S n = YYi=i b;X“ then 

E— < 1 if and only if ^ln — < 0. (16.215) 

Growth rate (AEP) 


— log 5* ^ with probability 


(16.216) 


Asymptotic optimality 

. 1 S n . 

lim sup — log — < 0 with probability 1. 

n—oo 打 S* 

Wrong information. Believing g when / is true loses 
AW = W(b}, F) - W(b* F) < D(f\\g). 


Side information Y 


Chain rule 


AW < /(X; Y). 


VK*(X/|Xi, X 2 ,..., X,_i) = max Elogb-X,- 


(16.217) 


(16.218) 


(16.219) 


(16.220) 


W*(X 1 ,X 2 ,...,X )I ) = J] W*(X ! |X 1 ,X 2 ,..., X ; _i). (16.221) 
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Growth rate for a stationary market. 

W*(Xi,X 2 ,...,X,) 




- log 5 ： 




Competitive optimality of log-optimal portfolios. 


Pr(V5 > f/*5*) < 


Universal portfolio. 


6,(.) x,, , b n i= i 


where 




,„ 1+ .^ m=n Wn 2 ,...,nJ . 


For m = 2, _ 

v n 〜 y/l/Tin 

The causal universal portfolio 

c r i 、 /b5 ( (b,x ; )c?/x(b) 
i+l{X j = /5,(b,x0^(b) 

achieves 

义 (x” > 1 

5*(x w ) - 2V^TT 

for all n and all x n . 


AEP. If {Xi} is stationary ergodic, then 

—— log p(X\, X 2 , … ， X n ) H(PC) with probability 1. 
n 


(16.222) 

(16.223) 

(16.224) 

(16.225) 

(16.226) 

(16.227) 

(16.228) 

(16.229) 

(16.230) 
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PROBLEMS 

16.1 Growth rate. Let 


X = 


(1, a) with probability - 

(1, l/a) with probability ^ 


where a > l. This vector X represents a stock market vector of 
cash vs. a hot stock. Let 


W(b, F) = ElogWX 
and 

= max W(b, F) 
b 

be the growth rate. 

(a) Find the log optimal portfolio b*. 

(b) Find the growth rate W*. 

(c) Find the asymptotic behavior of 

n 

s^Ylb% 

i=l 

for all b. 

16.2 Side information . Suppose, in Problem 16.1, that 

y = j 1 if (X!,X 2 )> (1,1), 

0 if (Z 1? X 2 )< (1,1). 

Let the portfolio b depend on Y. Find the new growth rate W** 
and verify that AW = W** — satisfies 

AW < /(Z; Y). 

16.3 Stock dominance. Consider a stock market vector 


X=(Xi,Z 2 ). 

Suppose that X\ =2 with probability 1. Thus an investment in 
the first stock is doubled at the end of the day. 
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(a) Find necessary and sufficient conditions on the distribution of 
stock X 2 such that the log-optimal portfolio b* invests all the 
wealth in stock X 2 [i.e., b* = (0, 1)]. 

(b) Argue for any distribution on X 2 that the growth rate satisfies 
W* > 1. 

16.4 Including experts and mutual funds • Let X 〜 /^(x), x G 7^^, be 
the vector of price relatives for a stock market. Suppose that an 
“expert” suggests a portfolio b. This would result in a wealth 
factor b’X. We add this to the stock alternatives to form X = 
(X\, X 2 ,..., X m , b’X). Show that the new growth rate, 

W* = max f ln(Wi)dF(i )， (16.231) 

办 1 ，…，厶 m ， 厶 m+l »/ 

is equal to the old growth rate, 

W* = max / ln(b^x) dF(x). (16.232) 

办 1 ， … ， b，n J 

16.5 Growth rate for symmetric distribution. Consider a stock vec- 
torX 〜 F(x )， XeTZ m , X > 0, where the component stocks 
are exchangeable. Thus, F(xi, ^ 2 ,..., x m ) = F(x a (i), 々⑵，…， 
x G {rn)) for all permutations a. 

(a) Find the portfolio b* optimizing the growth rate and establish 
its optimality. Now assume that X has been normalized so 
that ^ Y17=\ 兄 = 1， and F is symmetric as before. 

(b) Again assuming X to be normalized, show that all symmetric 
distributions F have the same growth rate against b*. 

(c) Find this growth rate. 

16.6 Convexity. We are interested in the set of stock market densities 
that yield the same optimal porfolio. Let P\y 0 be the set of all 
probability densities on for which bo is optimal. Thus, 尸 b 0 = 
{p(x) : / ln(b t x)p(x) dx is maximized by b = bo}. Show that 

is a convex set. It may be helpful to use Theorem 16.2.2. 

16.7 Short selling. Let 

x= \ (1,2), p, 

(1 ， 圣)， 1 _ P. 

Let B = {(b\, b 2 ) : b\ + b 2 = 1}. Thus, this set of portfolios B 
does not include the constraint b[ > 0. (This allows short selling.) 
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(a) Find the log optimal portfolio b*(/?). 

(b) Relate the growth rate W*(p) to the entropy rate H{p). 

16.8 Normalizing x. Suppose that we define the log-optimal portfolio 
b* to be the portfolio maximizing the relative growth rate 

f b’x 

/In -—— dF(x u … ， x m ). 

J m 2li=\ 

The virtue of the normalization ^ which can be viewed as 

the wealth associated with a uniform portfolio, is that the relative 
growth rate is finite even when the growth rate f In ^xdF^x) 
is not. This matters, for example, if X has a St. Petersburg-like 
distribution. Thus, the log-optimal portfolio b* is defined for all 
distributions F, even those with infinite growth rates VK*(F). 

(a) Show that if b maximizes f ln(b’x) dF(x), it also maximizes 

/ In 穷 # (x)，where M = ( 士，士， … ，士 ). 

(b) Find the log optimal portfolio b* for 

(2 2k+ \2 2k ) 

(2 2k ,2 2k+l ) 

where k = 1 , 2 ,.... 

(c) Find EX and W*. 

(d) Argue that b* is competitively better than any portfolio b in 
the sense that Pr{b'X > cb*'X} < 

16.9 Universal portfolio. We examine the first n = 2 steps of the 
implementation of the universal portfolio in (16.7.2) for /i(b) uni¬ 
form for m = 2 stocks. Let the stock vectors for days 1 and 2 be 
xi = (1 ， 士 )， and X 2 = (1, 2). Let b = (b, l — b) denote a portfo¬ 
lio. 

(a) Graph 5 2 (b) = ]~[Li 0<b<l. 

(b) Calculate = maxb 52(b). 

(c) Argue that log 52(b) is concave in b. 

(d) Calculate the (universal) wealth §2 = Jq S 2 (b)db. 

(e) Calculate the universal portfolio at times n = 1 and n = 2: 

b l= f bdb 

Jo 



2—^+1)， 

2—^+1)， 
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62(X1)= 


/q 1 b5i (b) db 
fo 5i(b) db 


(f) Which of 52 (b), S 2 , 62 are unchanged if we permute the 

order of appearance of the stock vector outcomes [i.e., if the 
sequence is now (1, 2), (1, ^)]? 

16.10 Growth optimal. Let Zi, X 2 > 0, be price relatives of two inde¬ 
pendent stocks. Suppose that EX\ > EX 2 . Do you always want 
some of X 1 in a growth rate optimal portfolio 5(b) = bX\ + bX 2 °! 
Prove or provide a counterexample. 

16.11 Cost of universality. In the discussion of finite-horizon universal 
portfolios, it was shown that the loss factor due to universality is 





(16.233) 


Evaluate K for n = 1 ， 2, 3. 

16.12 Convex families. This problem generalizes Theorem 16.2.2. We 
say that 5 is a convex family of random variables if S\, S 2 ^ S 
implies that 入 5^ + (1 — 入 )& e Let 5 be a closed convex family 
of random variables. Show that there is a random variable 5* G 
such that 


-° 


for all 5 G 5 if and only if 




for all S eS. 


(16.234) 


(16.235) 


HISTORICAL NOTES 

There is an extensive literature on the mean-variance approach to invest¬ 
ment in the stock market. A good introduction is the book by Sharpe 
[491]. Log-optimal portfolios were introduced by Kelly [308] and Latane 
[346], and generalized by Breiman [75]. The bound on the increase in the 
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growth rate in terms of the mutual information is due to Barron and Cover 
[31]. See Samuelson [453, 454] for a criticism of log-optimal investment. 

The proof of the competitive optimality of the log-optimal portfolio 
is due to Bell and Cover [39, 40]. Breiman [75] investigated asymptotic 
optimality for random market processes. 

The AEP was introduced by Shannon. The AEP for the stock mar¬ 
ket and the asymptotic optimality of log-optimal investment are given 
in Algoet and Cover [21]. The relatively simple sandwich proof for the 
AEP is due to Algoet and Cover [20]. The AEP for real-valued ergodic 
processes was proved in full generality by Barron [34] and Orey [402]. 

The universal portfolio was defined in Cover [110] and the proof of 
universality was given in Cover [110] and more exactly in Cover and 
Ordentlich [135]. The fixed-horizon exact calculation of the cost of uni¬ 
versality V n is given in Ordentlich and Cover [401]. The quantity V n also 
appears in data compression in the work of Shtarkov [496]. 


CHAPTER 17 

INEQUALITIES IN 
INFORMATION THEORY 


This chapter summarizes and reorganizes the inequalities found throughout 
this book. A number of new inequalities on the entropy rates of subsets 
and the relationship of entropy and L p norms are also developed. The 
intimate relationship between Fisher information and entropy is explored, 
culminating in a common proof of the entropy power inequality and the 
Brunn-Minkowski inequality. We also explore the parallels between the 
inequalities in information theory and inequalities in other branches of 
mathematics, such as matrix theory and probability theory. 


17.1 BASIC INEQUALITIES OF INFORMATION THEORY 

Many of the basic inequalities of information theory follow directly from 
convexity. 

Definition A function / is said to be convex if 

/( 入 xi + (1 — ^)x 2 ) < Xf (x\) + (1 - 入 )/(x 2 ) (17.1) 

for all 0 $ 入 $ 1 and all x\ and X 2 . 

Theorem 17.1.1 {Theorem 2.6.2: Jensen’s inequality) Iff is convex, 
then 

f(EX)<Ef(X). (17.2) 

Lemma 17.1.1 The function \ogx is concave and x logx is convex, for 
0 < x < oc. 
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Theorem 17.1.2 (Theorem 2.7.1: Log sum inequality) For positive 
numbers a\, a 2 ,..., a n and b\ ， b 2 , …, b n , 

(H.3) 

with equality iff 普 =constant. 

We recall the following properties of entropy from Section 2.1. 

Definition The entropy H(X) of a discrete random variable X is de¬ 
fined by 

H{X) = - V P (x)l 0 gp(x)_ (17.4) 


Theorem 17.1.3 (Lemma 2.1.1, Theorem 2.6.4: Entropy bound) 

0<H{X)<\og\X\. (17.5) 

Theorem 17.1.4 {Theorem 2.6.5: Conditioning reduces entropy) For 
any two random variables X and Y, 

H{X\Y) < H(X), (17.6) 


with equality iff X and Y are independent. 

Theorem 17.1.5 {Theorem 2.5.1 with Theorem 2.6.6: Chain rule) 

n n 

H{Xu X 2 ,...,X n ) = Y J ( 17 . 7 ) 

/ =1 i=\ 

with equality iff X\, X 2 ,..., X n are independent. 

Theorem 17.1.6 (Theorem 2.7.3) H(p) is a concave function of p. 


We now state some properties of relative entropy and mutual informa¬ 
tion (Section 2.3). 


Definition The relative entropy or Kullback-Leibler distance between 
two probability mass functions p(x) and q{x) is defined by 


D(p\\q) = ^2 P( x ) lo g 

xeX 


Pix) 

qM 


(17.8) 
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Definition The mutual information between two random variables X 
and Y is defined by 


I(X; F) = 

xeA 1 yey 


P(x, y) 
p(x)p(y) 


= D(p(x, y)\\p(x)p(y)). 

(17.9) 


The following basic information inequality can be used to prove many 
of the other inequalities in this chapter. 

Theorem 17.1.7 {Theorem 2.6.3: Information inequality) For any 
two probability mass functions p and q, 


D(p\\q)>0 (17.10) 

with equality iff p(x) = q(x) for all x e X. 

Corollary For any two random variables X and Y, 

Y) = D( P (x, y)\\p(x)p(y))>0 (17.11) 


with equality iff p(x, y) = p(x)p(y) (i.e.，X and Y are independent). 
Theorem 17.1.8 (Theorem 2.7.2: Convexity of relative entropy) 


D(p\\q) is convex in the pair (p, q). 

Theorem 17.1.9 (Theorem 2.4.1) 

I{X\ Y) = H(X) - H{X\Y). (17.12) 

I{X\ Y) = H{Y) - H(Y\X). (17.13) 

Y) = H{X) + H(Y) - H(X ， Y). (17.14) 

I(X; X) = H(X). (17.15) 


Theorem 17.1.10 (Section 4.4) For a Markov chain: 

1. Relative entropy D(/x n \\/x f n ) decreases with time. 

2. Relative entropy D(/x n \\/x) between a distribution and the stationary 
distribution decreases with time. 

3. Entropy H{X n ) increases if the stationary distribution is uniform. 

4. The conditional entropy H{X n \X\) increases with time for a station¬ 
ary Markov chain. 
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Theorem 17.1.11 Let X\ ， X 2 , … ， X n be Li.d. 〜 p(x). Let p n be the 
empirical probability mass function of X\, X 2 , ..., X n . Then 

ED(p n \\p)<ED(p n . l \\p). (17.16) 

17.2 DIFFERENTIAL ENTROPY 

We now review some of the basic properties of differential entropy 
(Section 8.1). 

Definition The differential entropy h(X\, X 2 ,..., X n ), sometimes writ¬ 
ten h(f), is defined by 

h(X 1 ,X 2 ,...,X n ) = - I f(x) log f(x)dx. (17.17) 

The differential entropy for many common densities is given in 
Table 17.1. 

Definition The relative entropy between probability densities / and 
g is 

D{f\\g) = I /(x)log(/(x)/g(x)) dx. (17.18) 

The properties of the continuous version of relative entropy are iden¬ 
tical to the discrete version. Differential entropy, on the other hand, has 
some properties that differ from those of discrete entropy. For example, 
differential entropy may be negative. 

We now restate some of the theorems that continue to hold for differ¬ 
ential entropy. 

Theorem 17.2.1 {Theorem 8.6.1: Conditioning reduces entropy) 
h(X\Y) < h(X)，with equality iff X and Y are independent. 


Theorem 17.2.2 (Theorem 8.6.2: Chain rule) 

n n 

h(Xu x 2 ,..., X n ) = Xi_ 2 , ...,xo<Y] W) 

i=l i=l 

(17.19) 

with equality iff X\ ， X 2 , … ， X n are independent. 

Lemma 17.2.1 If X and Y are independent，then h(X + Y) > h(X). 
Proof: h(X + Y)> h(X + Y\Y) = h(X\Y) = h(X). □ 


17.2 DIFFERENTIAL ENTROPY 


661 


TABLE 17.1 Differential Entropies 3 


Distribution 

Entropy (nats) 

Name 

Density 

Beta 

/W _ r> ( >. / 

0<x<\, p,q>0 

\nB(p,q) -0-1) 

x [少 (p) — fip + q)] 

-(q - 1) [ 少 ⑷- *(p + q)] 

Cauchy 

/W - “2 +， 

—oo < JC < oo, 人 > 0 

ln(47r 入） 

Chi 

2 X 1 

f(r) = r n_1 ^ _ 2(7 2 

J ' 2 n l 2 o n T{n/2) 

x > 0, n > 0 

in ^)^ G) + f 

Chi-squared 

1 « _ i _ x 

f(x) = --- X 2 _ 丄已 2a 2 

J 2Wr("/2) 

x > 0, n > 0 

ln2ff2r © 

/ n\ /n\ n 

_ ( 1_ 2 )^( 2 ) + 2 

Erlang 

/W = 

0-1)! 

x, ^ > 0, n > 0 

(l-n)i/r(n) + ln^+« 

P 

Exponential 

1 i 

f(x) = —e 久， 入 > 0 

A, 

1 + In 入 

F 

n? rir? 

fM ~ BCH) 

’ 力 H 
x ， 
(n 2 -\-nix) 2 

x > 0, ri\, U 2 >0 

- dW?) 

n\-\-ri 2 , (nx^- U 2 \ 

+ 2 n 2 j 

Gamma 

x a ~ l e~^ 

f(x) = - - ， x,a, ^ > 0 

俨 r ⑻ 

ln(^r (a)) + (1 — oi)ij/(a) + a 

Laplace 

1 l^-^l 

m = Yx e x ' 

— oo < x,0 < oo, 入 > 0 

1 + In 2 入 

Logistic 

/⑻ _ (i+ e _” 2 ’ 

— OO < X < oo 

2 
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TABLE 17.1 (continued) 


Distribution 

Entropy (nats) 

Name 

Density 

Lognormal 

ln(;c_m)2 

汹 - 2，72 ' 

x > 0, —oo < m < oo, a > 0 

m + ^ ln(27re(T 2 ) 

Maxwell- 

Boltzmann 

f{x) = An~^^x 2 e- pi2 , 

x, ^ > 0 

1 71 1 

-In - h y - 

2 p r 2 

Normal 

1 (x-fj.) 2 

/W - , —— ^ 2or2 ， 

w2ti(7 2 

— OO < X, fji < oo, cr > 0 

去 \n{27tea 2 ) 

Generalized 

normal 

f(x) = X a ~ l e~^ x2 , 

J y r(f) 

x, a, p >0 

厂 (|) a - l /a\ a 

ln i 2 少（ 2)+2 

2^2 L 1 

Pareto 

a 

f(x) = — —, x>k>0,a>0 
x a+[ 

, k 1 

ln — h 1 H —— 
a a 

Rayleigh 

x x 2 

fix) = -rje~ 2 b 2 , x,b > 0 
b z 

p y 

1 + ln 士 + ^ 

V2 2 

Student's t 

… (l+x 2 /«) _(n+1)/2 

n + l ^( n + X \ ^( n \ 

nx) ' 

—oo < jc < oo, n > 0 

2 2 ) 伞、 2) 

+inv ^ B G’i) 

Triangular 

fM = 

2x 

— , 0 < x < a 

a 

2(1-x) t 

- , a < x < l 

1-a -- 

1 

- -ln2 

2 

Uniform 

/W = a , a <x < p 

p — a 

\n(P — a) 

Weibull 

c , x c 

/(jc) = —x c ~ l e~~, x,c,a > 0 
a 

l 

(c-l)y ac 

- h ln - hi 

c c 


a All entropies are in nats; T(z) = / 0 °° e~ l t z ~ l dt; lnr(z); y = Euler's constant = 

0.57721566.... 

Source: Lazo and Rathie [543]. 
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Theorem 17.2.3 {Theorem 8.6.5) Let the random vector \ e R ,! have 


zero mean and covariance K = EXX f (i.e.，Kij = EXiXj y 1 < i, 
Then 

h(X) < ^\og(2ne) n \K\ 
with equality iffX 〜 Af(0, K). 


j < n). 
(17.20) 


17.3 BOUNDS ON ENTROPY AND RELATIVE ENTROPY 


In this section we revisit some of the bounds on the entropy function. The 
most useful is Fano’s inequality, which is used to bound away from zero 
the probability of error of the best decoder for a communication channel 
at rates above capacity. 

Theorem 17.3.1 {Theorem 2.10.1: Fano’s inequality) Given two ran¬ 
dom variables X and Y，let X = g(Y) be any estimator of X given Y and 
let P e = Pr(X ^ X) be the probability of error. Then 

H{P e ) + P e \og\X\ > H(X\X) > H{X\Y). (17.21) 

Consequently，if H(X\Y) > 0, then P e > 0. 

A similar result is given in the following lemma. 

Lemma 17.3.1 (Lemma 2.10.1) If X and X’ are i.i.d. with entropy 
H{X) 

Pr(X = X') > 2~ h(x) (17.22) 

with equality if and only if X has a uniform distribution. 

The continuous analog of Fano’s inequality bounds the mean-squared 
error of an estimator. 

Theorem 17.3.2 (Theorem 8.6.6) Let X be a random variable with 
differential entropy h{X). Let X be an estimate of X， and let E(X — X) 2 
be the expected prediction error. Then 

E(X - X) 2 > — e 2h(X \ (17.23) 

2ne 

Given side information Y and estimator X(Y), 

E(X - X(Y)) 2 > —e 2ft(x|7) . (17.24) 

2ne 
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Theorem 17.3.3 (C\ bound on entropy) Let p and q be two proba¬ 

bility mass functions on X such that 

lip-dll = IpW - g ⑴ I (17.25) 

Then 

\H(p) - H{q)\ <-\\p- ^Hi log llP . (17.26) 

1^1 


Proof: Consider the function f(t) = —tlogt shown in Figure 17.1. It 
can be verified by differentiation that the function /(•) is concave. Also, 
/(0) = /(l) = 0. Hence the function is positive between 0 and 1. Con¬ 
sider the chord of the function from t to t + v (where v < \). The 
maximum absolute slope of the chord is at either end (when t = 0 or 
1 — v). Hence for 0 < ^ < 1 — v, we have 

1/(0 - fit + v)| < max{/(v), /(I - v)} = —vlog v_ (17.27) 

Let r(x) = \p(x) — q{x)\. Then 

\H(p) - H{q)\ = [(-p(x)logp(x) + g(x)logg(x)) (17.28) 
< L l(-p ⑴ logp(x)+ q ⑴ logq(x))| (17.29) 


f{t) =-t\nt 



FIGURE 17.1. Function f{t) = -tint. 
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< 


^2 ~r(x)\ogr(x) 


xeX 


r(x) r(x) 

lip -^lli ~v ~ - ~~ lp g m- Up ~^lli 


xeA ： 


lip -^lli Wp-qW\ 


-\\p - gill log lip -q\\i + \\p-q\\iH 


r ⑴ 


、llp n 

< -\\p - q\\l^og\\p - q\h + \\p-q\\ilog\X\, 


(17.30) 

(17.31) 

(17.32) 

(17.33) 


where (17.30) follows from (17.27). 


□ 


Finally, relative entropy is stronger than the C\ norm in the following 
sense: 


Lemma 17.3.2 (Lemma 11.6.1) 


D( Pl \\p 2 ) > - P2\\l (17.34) 

2 In 2 

The relative entropy between two probability mass functions P(x) and 
Q(x) is zero when P = Q. Around this point, the relative entropy has 
a quadratic behavior, and the first term in the Taylor series expansion of 
the relative entropy D(P\\Q) around the point P = Q is the chi-squared 
distance between the distributions P and Q. Let 


Lemma 17.3.3 


x 2 (p, q) = J 2 


( 尸⑴ - Q(x)) 2 

Qix) 


For P near Q, 


D{P || Q) = - X z + 


(17.35) 


(17.36) 


Proof: See Problem 11.2. 


□ 


17.4 INEQUALITIES FOR TYPES 

The method of types is a powerful tool for proving results in large devi¬ 
ation theory and error exponents. We repeat the basic theorems. 

Theorem 17.4.1 {Theorem 11.1.1) The number of types with denom¬ 
inator n is bounded by 

\v n \ < (n + 1) 1 ^ 1 . (17.37) 
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Theorem 17.4.2 (Theorem 11.1.2) If X\ ， X 2 , … ， X n are drawn i.i.d. 
according to Q{x), the probability of x n depends only on its type and is 
given by 


Q n (x n ) = 2"~ n ( 丑 ( 尸(尸 


(17.38) 


Theorem 17.4.3 (Theorem 11.1.3: Size of a type class T ( 尸 )）For any 
type P G V n , 


^L^2^)<|r ( P)|<2^). (17.39) 


Theorem 17.4.4 (Theorem 11.1.4) For any P e V n cind any distribu¬ 
tion Q, the probability of the type class T(P) under Q n is 2~ nD ^ p ^^ to 
first order in the exponent. More precisely ， 


("+ 1 ) 闪 


-nD(P\\Q) 


< Q n (T(P))<2 


-nD(P\\Q) 


(17.40) 


17.5 COMBINATORIAL BOUNDS ON ENTROPY 

We give tight bounds on the size of (=) when k is not 0 or n using the 
result of Wozencraft and Reiffen [568]: 


Lemma 17.5.1 For 0<p<l，q = l_p，such that np is an integer, 

1 < ( n \ 2 - nH ^ < 1 . 

-y/Snpq \np) ^nnpq 


(17.41) 


Proof: We begin with a strong form of Stirling’s approximation [208], 
which states that 


^n\ n 


， n\ n 


\f2nn (_) <n\< \f2nn (-) e 12« . 
Applying this to find an upper bound, we obtain 


(17.42) 



< 




V2^(fr p V2^Cfr q 


- e 


^/2nnpq p np q nq 


(17.43) 

(17.44) 
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< - 2 nH(p \ (17.45) 

^iznpq 

since e~^ < = 1.087 < V2, hence proving the upper bound. 

The lower bound is obtained similarly. Using Stirling’s formula, we 
obtain 


(；) 


( \2np + I2nq ) 


> 


V2^(^) n e \ l2np 

1 1 + 


>J2nnpq p np q nq 

1 


s/hrnpq 

If np > 1, and nq > 3, then 

-(i2 即 + \2nq) 


r^nH{p) ^ 


> e~9 = 0.8948 > 


\2np 




+ m^) 

0.8862, 


(17.46) 

(17.47) 

(17.48) 


(17.49) 


and the lower bound follows directly from substituting this into the equa¬ 
tion. The exceptions to this condition are the cases where np = l, nq = l 
or 2, and np = 2, nq = 2 (the case when np > 3, nq = l or 2 can be 
handled by flipping the roles of p and q). In each of these cases 


np = 1 ， nq = 1 4 n = 2, p = I ， ( 二 ） = 2, bound = 2 

np = l, nq = 2 ^ n = 3, p = ^, (二 ） = 3, bound = 2.92 

np = 2, nq = 2 ^ n = 4, p = ( 二 ）= 6, bound = 5.66. 

Thus, even in these special cases, the bound is valid, and hence the lower 
bound is valid for all p ^ 0,1. Note that the lower bound blows up when 
p = 0 or p = l, and is therefore not valid. □ 


17.6 ENTROPY RATES OF SUBSETS 

We now generalize the chain rule for differential entropy. The chain rule 
provides a bound on the entropy rate of a collection of random variables 
in terms of the entropy of each random variable: 


h(x u x 2 ,... i x n ) < 


(17.50) 
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We extend this to show that the entropy per element of a subset of a set of 
random variables decreases as the size of the subset increases. This is not 
true for each subset but is true on the average over subsets, as expressed 
in Theorem 17.6.1. 


Definition Let (X\, X 2 ,..., X n ) have a density, and for every S c 
{1, 2,..., n}, denote by X(S) the subset {X/ : i e S). Let 


An) 


0 


E 

S:\S\=k 


h(X(S)) 


(17.51) 


. . . 

Here % is the average entropy in bits per symbol of a randomly drawn 
^-element subset of {X\, X 2 ,..., X n }. 

The following theorem by Han [270] says that the average entropy 
decreases monotonically in the size of the subset. 


Theorem 17.6.1 


(17.52) 


Proof: We first prove the last inequality, < d We write 

h(X u X 2 , …， D = h(X u X 2 , …, X n ^)+h(X n \X u X 2 , … ， u ， 
h(X u X 2 ,... ， X") = h(X u X 2 , …,^_ 2 , X n ) 

+ h{X n -\\X\, X 2 ,..., X n -2, X n ), 

^ h(X\, X 2 , ..., X n _2, X n ) 

+ h{X n -\\X\, X 2 ,..., X n _2 )， 


h(X u X n ) < h(X 2 , X 3 , …， D + h{X x ). 

Adding these n inequalities and using the chain rule, we obtain 

n 

nh(X\, X 2 ,..., X n ) < 二 h(X\, X 2 , ..., Xi-\, Xi + \, ..., X n ) 

/ =1 

+ h(X u X 2 ,...,X n ) (17.53) 


or 


-h(X u X 2 ,...,X n )<-Y] 

n n 


h(X\, X 2 , •. • ， Xi-i ， X/+1 ， •. • ， X n ) 


(17.54) 
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which is the desired result We now prove that 

for all k < n by first conditioning on a ^-element subset, and then taking 
a uniform choice over its (k — 1)-element subsets. For each A:-element 
subset, and hence the inequality remains true after taking 

the expectation over all /^-element subsets chosen uniformly from the n 
elements. □ 


Theorem 17.6.2 Let r > 0, and define 


Then 


4 n) 



E 


rh(X(S)) 
e k 


S:\S\=k 


t[ n) > 4 n) 



(17.55) 


(17.56) 


Proof: Starting from (17.54), we multiply both sides by r, exponentiate, 
and then apply the arithmetic mean geometric mean inequality, to obtain 


e \ ir h{X x ,X 2 ,...,X n ) 

丄 rh(X[,X2^---Xj_i ， Xj + [ ,...,X n ) 

< 乙 /=1 (n—1) 

I n rh{X { ,X 2 ， •… X i _ { ,X i+x ....,X n ) 

< — e ( n_1 ) for all r > 0, 

n 

i=l 


(17.57) 

(17.58) 


which is equivalent to . Now we use the same arguments as 

in Theorem 17.6.1, taking an average over all subsets to prove the result 
that for all k < n, □ 

Definition The average conditional entropy rate per element for all 
subsets of size k is the average of the above quantities for A:-element 
subsets of {1, 2, … ， n}: 



丄 v h(X(S)\X(S c )) 

0 sT^= k k 


(17.59) 


Here gk{S) is the entropy per element of the set S conditional on the 
elements of the set S c • When the size of the set S increases, one can 
expect a greater dependence among the elements of the set 5, which 
explains Theorem 17.6.1. 
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In the case of the conditional entropy per element, as k increases, the 
size of the conditioning set S c decreases and the entropy of the set S 
increases. The increase in entropy per element due to the decrease in 
conditioning dominates the decrease due to additional dependence among 
the elements, as can be seen from the following theorem due to Han [270]. 
Note that the conditional entropy ordering in the following theorem is the 
reverse of the unconditional entropy ordering in Theorem 17.6.1. 

Theorem 17.6.3 

gf 0 < gf < ••- < 8^. (17.60) 

Proof: The proof proceeds on lines very similar to the proof of the 
theorem for the unconditional entropy per element for a random subset. 
We first prove that g^ /7) > and then use this to prove the rest of 
the inequalities. By the chain rule, the entropy of a collection of random 
variables is less than the sum of the entropies: 

n 

h(X u X 2 ,...,X n ) (17.61) 

i=l 

Subtracting both sides of this inequality from nh(X\, X 2 ,..., X n ), we 
have 

n 

(n - l)h(X u X 2 , ...,X n )>J2 (HX u X 2 , ...,X n )~ h(Xi)) (17.62) 

i=l 

n 

= > : h{X 1 ,..., Xi—i, , X n \Xi^. 

i=l 

(17.63) 

Dividing this by n(n — 1), we obtain 

h(X 1 ,X 2 ,...,X n ) 
n 

(17.64) 

which is equivalent to We now prove that for 

all k < n by first conditioning on a 众 -element subset and then taking 
a uniform choice over its {k — 1)-element subsets. For each /r-element 
subset, and hence the inequality remains true after taking 

the expectation over all ^-element subsets chosen uniformly from the n 
elements. □ 


> 


E 


h(X\, X2 ,. • • ， Xi-i, X/ + i, •.. ， X n \Xi) 




Theorem 17.6.4 Let 


17.7 ENTROPY AND FISHER INFORMATION 


671 


Then 


ft 、 



E 

S:\S\=k 


i(x(sy,x(s c )) 

k 


/r>/ 2 (n) >-*->/i w) . 


(17.65) 


(17.66) 


Proof: The theorem follows from the identity I(X(S); X(S C ))= 
h(X(S))~ h(X(S)\X(S c )) and Theorems 17.6.1 and 17.6.3. □ 


17.7 ENTROPY AND FISHER INFORMATION 


The differential entropy of a random variable is a measure of its descriptive 
complexity. The Fisher information is a measure of the minimum error 
in estimating a parameter of a distribution. In this section we derive a 
relationship between these two fundamental quantities and use this to 
derive the entropy power inequality. 

Let X be any random variable with density f(x). We introduce a loca¬ 
tion parameter 0 and write the density in a parametric form as f(x — 6). 
The Fisher information (Section 11.10) with respect to 9 is given by 


J(0) = 



fix - 9) 


de 


In f(x - 6>) 


dx. 


(17.67) 


In this case, differentiation with respect to x is equivalent to differentiation 
with respect to 9. So we can write the Fisher information as 


J(X) 


which we can rewrite as 
J(X) 


f{x - 6) 


dx 


In f{x - 9) 


dx 


/ ⑴ 


石 ln/(x) 


dx. 


fix) 


f(x) 


dx. 


(17.68) 


(17.69) 


We will call this the Fisher information of the distribution of X. Notice 
that like entropy, it is a function of the density. 

The importance of Fisher information is illustrated in the following 
theorem. 
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Theorem 17.7.1 (Theorem 11.10.1: Cramer-Rao inequality) The 
mean-squared error of any unbiased estimator T (Z) of the parameter 9 is 
lower bounded by the reciprocal of the Fisher information: 

1 

var(T) > —— . (17.70) 

_ J{6) 

We now prove a fundamental relationship between the differential 
entropy and the Fisher information: 

Theorem 17.7.2 (de Bruijn’s identity: entropy and Fisher information) 
Let X be any random variable with a finite variance with a density f(x). 
Let Z be an independent normally distributed random variable with zero 
mean and unit variance. Then 


-h e (X + VtZ) = -J(X + VtZ), 

at L 


(17.71) 


where h e is the differential entropy to base e. In particular, if the limit 
exists as t ^ 0, 


-h e (X + VtZ) 

at 


t=0 


2 7W - 


Proof: Let Y t = X + ^/tZ. Then the density of Y t is 


gt(y) 




\f2nt 


{y-x) z 

: e 2 / dx. 


Then 


dt 


8t{y) = L fix) Tt 


\f2nt 


{y-x) z 

:e 


dx 


(17.72) 


(17.73) 


(17.74) 


'fM 

—oo 

(y - x ) 2 


2t 2 


It v / 2jr7 

1 (y-x) 2 

e 


(y-x) z 
-.e 2 / 


We also calculate 


⑺ 00 


/w 

fM 


1 


3 


\f2nt dy 

1 


(y-x) 1 
e 2 / 


dx. 


dx 


\j7jit 




dx 


(17.75) 

(17.76) 

(17.77) 
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a 2 

dy 


2 gtiy) 




fix) 


\f2nt 


x (y~ x r 

—e 


it 


dx 


\j7jit 


(y-x) 


-e 


2, 


Thus, 


2df 


^ t 8t(y) = o^ 8t(y) - 


(17.78) 


dx. 


(17.79) 


(17.80) 


We will use this relationship to calculate the derivative of the entropy of 
Y t , where the entropy is given by 


h e (Y t )= - 

Differentiating, we obtain 


8t(y)^ g t {y)dy. 


(17.81) 


^ ) = -.L Yt 8t(y)dy -L ^ 8tiy) ln8t(y)dy (17 - 82) 




dt 


8t(y) dy 




2 J ^ 2 St(y) ^gt(y)dy. (17.83) 


The first term is zero since f gt(y) dy = l. The second term can be inte¬ 
grated by parts to obtain 


Vt he{Yt) = ~2 


^g t (y) 

3j 


In 沿 （ J) 


+ 2 . 


Jy 


gt(y) 


gt(y) 


dy. 

(17.84) 


The second term in (17.84) is So the proof will be complete if 

we show that the first term in (17.84) is zero. We can rewrite the first 
term as 


dg t (y) 




ingr(y)= 


- dg t (y ) - 
dy 

Vgt(y) 


[2V^OO In Vg^(y)j. 


(17.85) 


The square of the first factor integrates to the Fisher information and 
hence must be bounded as y —> 士 oc. The second factor goes to zero since 
x \nx — > 0 as x —> 0 and g t (j) 0 as y ^ 土 oc. Hence, the first term in 
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(17.84) goes to 0 at both limits and the theorem is proved. In the proof, we 
have exchanged integration and differentiation in (17.74), (17.76), (17.78), 
and (17.82). Strict justification of these exchanges requires the application 
of the bounded convergence and mean value theorems; the details may 
be found in Barron [30]. □ 

This theorem can be used to prove the entropy power inequality, which 
gives a lower bound on the entropy of a sum of independent random 
variables. 

Theorem 17.7.3 (Entropy power inequality) IfX and Y are indepen¬ 
dent random n-vectors with densities, then 


2 lHX + Y) ^ 2 ^h(X) + 2 Ihy) 


(17.86) 


We outline the basic steps in the proof due to Stam [505] and Blachman 
[61]. A different proof is given in Section 17.8. 

Stam’s proof of the entropy power inequality is based on a perturbation 
argument. Let n = l. Let X t = X + f{t)Z\, Y t = Y + ^/g{t)Z 2 , where 

Z\ and Z 2 are independent AT(0, 1) random variables. Then the entropy 
power inequality for n = 1 reduces to showing that ^(0) < 1, where we 
define 


s(t)= 


22h(X t ) + 22h(Y t ) 
~ 2^h{X t +Y t )"" 


(17.87) 


If f{t) —> oc and g{t) 00 as ^ > 00, it is easy to show that 5(00) = 1 . 
If, in addition, s\t) > 0 for ^ > 0, this implies that ^(0) < 1. The proof 
of the fact that s r {t) > 0 involves a clever choice of the functions f{t) 
and g(t), an application of Theorem 17.7.2 and the use of a convolution 
inequality for Fisher information, 


J(X + Y) -J(X) + 7(Y) 


(17.88) 


The entropy power inequality can be extended to the vector case by 
induction. The details may be found in the papers by Stam [505] and 
Blachman [61]. 


17.8 ENTROPY POWER INEQUALITY AND 
BRUNN-MINKOWSKI INEQUALITY 

The entropy power inequality provides a lower bound on the differential 
entropy of a sum of two independent random vectors in terms of their 
individual differential entropies. In this section we restate and outline an 
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alternative proof of the entropy power inequality. We also show how the 
entropy power inequality and the Brunn-Minkowski inequality are related 
by means of a common proof. 

We can rewrite the entropy power inequality for dimension n = 1 in 
a form that emphasizes its relationship to the normal distribution. Let 
X and Y be two independent random variables with densities, and let 
X' and Y' be independent normals with the same entropy as X and 
Y, respectively. Then 2 2h ( x 、 = 2 2h 、 x ’、 = (2ne)a^ f and similarly, 2 2h ( Y )= 
(27re)ay,. Hence the entropy power inequality can be rewritten as 

2 2h(x+Y) > (27ze){al，+ ol，) = 2 2h(x，+Y，) , (17.89) 

since X’ and Y' are independent. Thus, we have a new statement of the 
entropy power inequality. 


Theorem 17.8.1 {Restatement of the entropy power inequality) For 
two independent random variables X and Y, 

h(X + Y)> h(X r + Y f ), (17.90) 

where X' and Y' are independent normal random variables with h (X f )= 
h(X) andh(Y f ) = h(Y). 


This form of the entropy power inequality bears a striking resemblance 
to the Brunn-Minkowski inequality, which bounds the volume of set 
sums. 


Definition The set sum A + B of two sets A, B C TZ n is defined as the 
set {x + y : x e A, y e B). 

Example 17.8.1 The set sum of two spheres of radius 1 is a sphere of 
radius 2. 


Theorem 17.8.2 (Brunn-Minkowski inequality) The volume of the set 
sum of two sets A and B is greater than the volume of the set sum of two 
spheres A! and B f with the same volume as A and B, respectively: 

V(A + B)> V(A f + B f ), (17.91) 

where A! and B f are spheres with V {A r ) = V (A) and V (B f ) = V (B). 

The similarity between the two theorems was pointed out in [104]. 
A common proof was found by Dembo [162] and Lieb, starting from a 
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strengthened version of Young’s inequality. The same proof can be used to 
prove a range of inequalities which includes the entropy power inequality 
and the Brunn-Minkowski inequality as special cases. We begin with a 
few definitions. 

Definition Let / and g be two densities over TZ n and let / * g denote 
the convolution of the two densities. Let the C r norm of the density be 
defined by 


ll/llr 


f r (x)dx 


(17.92) 


Lemma 17.8.1 {Strengthened Young f s inequality) For any two densi¬ 
ties f and g over TZ n ， 


where 


and 


Wf^gWr < \\f\\p\\g\\ q , 


1 1 1 
—= - 1 - 1 


r P q 


Cp 


p lp 

丄 

P’7 


1 1 
~p + ? 


1. 


(17.93) 


(17.94) 


(17.95) 


Proof: The proof of this inequality may be found in [38] and [73]. □ 

We define a generalization of the entropy. 

Definition The Renyi entropy h r {X) of order r is defined as 


h r (X)= 



(17.96) 


for 0 < r < oo, r ^ 1. If we take the limit as r ^ 1, we obtain the Shan¬ 
non entropy function, 

h(X) = h x {X) = - f f(x)logf(x)dx. (17.97) 


If we take the limit as r —> 0, we obtain the logarithm of the volume of 
the support set, 


h 0 (X) = log (fi{x : / ⑻ > 0}) • 


(17.98) 
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Thus, the zeroth-order Renyi entropy gives the logarithm of the measure 
of the support set of the density /, and the Shannon entropy h\ gives the 
logarithm of the size of the “effective” support set (Theorem 8.2.2). We 
now define the equivalent of the entropy power for Renyi entropies. 

Definition The Renyi entropy power V r {X) of order r is defined as 


v r (X)= 


[/尸⑴办] - ” r ， 
exp[f/i(X)], 

/x({x : f(x) > 0})» , 


0<r<cx),r^l,i + p- = l 
r = l 

r = 0 

(17.99) 


Theorem 17.8.3 For two independent random variables X and Y and 
any 0 < r < oo and any 0 $ 入 $ 1, we have 

log + y)> A log V p (X) + (1- A) log V q (Y) + H(X) 

+J^[ H (^^))-// ( 击 )] ， (17.100) 

where p = (r+l(l-r)y q = (r+d-lnl-r)) and H(X) — —X log X — (1 — 入 ) 
log(l — 入 ). 

Proof: If we take the logarithm of Young’s inequality (17.93)，we obtain 

^7 log + Y)> -ilog V p (X) + ^log V q (Y) + logC r 

-logC p -\ogC q . (17.101) 

Setting X = 〆/// and using (17.94), we have l — X = r r /q\ p = r+A( r 1-r) 
and q = ;. +(1 : Thus, (17.101) becomes 

log V r (X + F) > 入 log V P (X) + (1 — 入 )log V q (Y) + ylogr- log〆 

r' r' r' r' 

—— log p + — log p —— log? + — log^ 

P P q q 

(17.102) 

= X\ogV p {X) + {\-X)\ogV q (Y) 

〆 / 

+ —— log r ——( 入 + 1 —— 入 ) log v 
r 

y' y' 

- log p X log p - log q (1 — X) log q 

p q 


(17.103) 
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UogV p (X) + (l - X)\ogV q (Y) + 


r - 1 


logr + H{X) 


r + 入 （1 — r) 


log 


r + X(l -r) 


r + (l - 入 )(1 -r) 
r — 1 


log 


r + (l-A)(l-r) 


(17.104) 


入 log V P (X) + (1 — A) log V q (Y) + H(k) 


l-r 


H 


V + A.(l - rY 


H 


1 +r ) \1 +r/ 

(17.105) 

where the details of the algebra for the last step are omitted. □ 

The Brunn-Minkowski inequality and the entropy power inequality 
can then be obtained as special cases of this theorem. 

• The entropy power inequality. Taking the limit of (17.100) as r —> 1 
and setting 


入 


Vi(X) 


we obtain 


R(X) +rot 

VdX + Y) > V!(X) +Vi(F), 


(17.106) 

(17.107) 


which is the entropy power inequality. 

The Brunn-Minkowski inequality. Similarly, letting r 0 and choos¬ 


ing 


VWx) 

7wT+7w! 


(17.108) 


we obtain 


VVq(X + F) > VVoCX) + VVo(F). (17.109) 


Now let A be the support set of X and B be the support set of Y. 
Then A + B is the support set of X + Y, and (17.109) reduces to 

[/X(A + B)]n > [fl(A)]n + [fl(B)]n, (17.110) 

which is the Brunn - Minkowski inequality. 
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The general theorem unifies the entropy power inequality and the 
Brunn-Minkowski inequality and introduces a continuum of new 
inequalities that lie between the entropy power inequality and the 
Brunn-Minkowski inequality. This further strengthens the analogy 
between entropy power and volume. 


17.9 INEQUALITIES FOR DETERMINANTS 

Throughout the remainder of this chapter, we assume that is a nonneg¬ 
ative definite symmetric n x n matrix. Let |^| denote the determinant of 
K. 

We first give an information-theoretic proof of a result due to Ky Fan 
[199]. 


Theorem 17.9.1 log |^ | is concave. 

Proof: Let X\ and X 2 be normally distributed n-vectors, X/ 〜 A/*(0, Ki), 
i = 1,2. Let the random variable 9 have the distribution 


Pr{9 = 1}= 入， (17.111) 

Pr{0 = 2} = l-X (17.112) 

for some 0 < A < 1. Let 9, X\, and X 2 be independent, and let Z = 
Xq. Then Z has covariance Kz = 入 + (1 — 入）欠 2 . However, Z will 
not be multivariate normal. By first using Theorem 17.2.3, followed by 
Theorem 17.2.1, we have 

^\og{2ne) n \XKx + {l -X)K 2 \ > h(Z) (17.113) 

> h(Z\0) (17.114) 

+ {\-X) l -\og{2ne) n \K 2 \. 

Thus, 


i^i + a -a)^ 2 i > i^i| a i^ 2 i 1_ \ 

as desired. 


(17.115) 

□ 


We now give Hadamard’s inequality using an information-theoretic 
proof [128]. 
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Theorem 17.9.2 {Hadamard) |^| < YlKa f with equality iff K"= 
0, i ^ j. 

Proof: Let X 〜 Af(0, K). Then 

1 n 1 

-\og{2ne) n \K\ = h(X u X 2 ,..., X n ) <J^h(Xi) = -log277-^1^|, 

i=l L 

(17.116) 

with equality iff X\, X 2 , … ， X n are independent (i.e. ， K” = 0, / # i ) •口 

We now prove a generalization of Hadamard’s inequality due to Szasz 
[391]. Let K(i\, 12 , ..., 4) be the k x k principal submatrix of K formed 
by the rows and columns with indices i\, i 2 , … ， 4. 

Theorem 17.9.3 (Szasz) IfK is a positive definite n x n matrix and 
Pk denotes the product of the determinants of all the principal k-rowed 
minors of K, that is ， 

P k = n (17.117) 

1 d 1 < *’2 < … 4 

then 

Pl > > P P 1) > >Pn- (17.118) 


Proof: Let X 〜 7V*(0, K). Then the theorem follows directly from 


Theorem 17.6.1, with the identification h 


⑻ 


HVi) 


log P k + ilog 2ne. 


□ 


We can also prove a related theorem. 


Theorem 17.9.4 Let K be a positive definite n x n matrix and let 


Then 


S^ ] = J2 \K{iuh,...,h)\y 

^tr{K) = > S^ n) n Sf = \K\n. 


(17.119) 


(17.120) 


Proof: This follows directly from the corollary to Theorem 17.6.1, with 
the identification = {2ne)S^ and r = 2. □ 





Theorem 17.9.5 


Let 
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Qk 


n 

K S:\S\=k 


\K\ 


W) 


\K(S C )\ 


(17.121) 


Then 


n 


2l < 02 < * * < Qn-l < Qn = \K\n. 


(17.122) 


Proof: The theorem follows immediately from Theorem 17.6.3 and the 
identification 

1 a 1^1 


哪 ) 丨 聊卞 ㈣ V ⑺ I 


□ 


The outermost inequality, Q\ < Q n , can be rewritten as 

i^i >fl^ 2 , 

i=l 

where 

2 1^1 


I( 1 , 2 ..., / — 1，/ + 1 ， ..., ti ) I 


(17.123) 


(17.124) 


(17.125) 


is the minimum mean-squared error in the linear prediction of Z/ from 
the remaining X 9 s. Thus, of is the conditional variance of Xi given the 
remaining X/s if X\, X 2 ,..., X n are jointly normal. Combining this with 
Hadamard’s inequality gives upper and lower bounds on the determinant 
of a positive definite matrix. 

Corollary 




(17.126) 


Hence, the determinant of a covariance matrix lies between the product 
of the unconditional variances Ka of the random variables Xi and the 
product of the conditional variances af. 

We now prove a property of Toeplitz matrices, which are important as 
the covariance matrices of stationary random processes. A Toeplitz matrix 
K is characterized by the property that Kij = K rs if |/ — j\ = |r — 
Let Kk denote the principal minor 欠 （ 1 ， 2 ,... ,k). For such a matrix, the 
following property can be proved easily from the properties of the entropy 
function. 
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Theorem 17.9.6 If the positive definite n x n matrix K is Toeplitz，then 


i _JL_ 丄 

l^il > \K 2 \^ > > > \K n \n (17.127) 

and is decreasing in k，and 

I \K n \ 

lim \K n \n = lim (17.128) 

oo n—oo \K n -\\ 

Proof: Let (X\, X 2 , • • • ， D 〜 A/*(0, K n ). We observe that 

h(X k \X k _i, ...,X l ) = h(X k ) - h{X k ~ l ) (17.129) 

= l -\og(27ze)^^- (17.130) 

Thus, the monotonicity of follows from the monotonocity of 

h(X/ c \Xi c -i ,..., X\), which follows from 

h(X k \X k ^ u = h(X M \X k , ••• ， X 2 ) (17.131) 

>h(X M \X k ,...,X 2 ,X l ), (17.132) 


where the equality follows from the Toeplitz assumption and the inequality 
from the fact that conditioning reduces entropy. Since , X\) 

is decreasing, it follows that the running averages 

1 l k 

-h{X u ...,X k ) = ~Y J h{X i \X i _ u ...,X l ) (17.133) 

K i=l 

are decreasing in k. Then (17.127) follows from h{X\, X 2 ,..., X^)= 

\\og( 27 re) k \K k \. □ 

Finally, since h(X n \X n -\,..., X\) is a decreasing sequence, it has a 
limit. Hence by the theorem of the Cesaro mean, 

h(X u X 2 ,...,X n ) 1 A 

lim ■ —— = lim - h(X k \X k ^ u ..., X x ) 

n^-cx) n n^-oo n 

k=l 

=lim h(X n \X n . u … ， Xi). (17.134) 

00 
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Translating this to determinants, one obtains 


lim \K n \ 



(17.135) 


Theorem 17.9.7 {Minkowski inequality [390]) 

\K { + K 2 \ l/n > \K { \ 1/n + \K 2 \ 1/n . 


(17.136) 


Proof: Let Xi, X 2 be independent with X/ 〜 A/"(0, Ki). Noting that Xi + 
X 2 〜 K\ + K 2 ) and using the entropy power inequality (Theorem 
17.7.3) yields 


2 f/z(X!+X 2 ) 

2^(Xi) + 2^(X2) 
(27te)\K i \^ n + (2jre)\K 2 \ l/n . 


(2ne)\K { + K 2 \ l/n = 2n 


(17.137) 


(17.138) 

□(17.139) 
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We now prove similar inequalities for ratios of determinants. Before devel¬ 
oping the next theorem, we make an observation about minimum mean- 
squared-error linear prediction. If (X1, X2, ..., X n ) 〜 A/*(0 , K n ), we know 
that the conditional density of X n given (X\, X2 ,..., X n -\) is univariate 
normal with mean linear in X\, X2, ..., X n -\ and conditional variance 
a^. Here is the minimum mean squared error E{X n — X n ) 2 over all 
linear estimators X n based on X\, X2 ,..., X n -\. 

Lemma 17.10.1 a„ 2 = \K n \/\K n ^ x \. 

Proof: Using the conditional normality of X n , we have 


- loginea^ = h(X n \Xi, X 2 , … ， X n - X ) 


(17.140) 


= h(X u X 2 , … ， X n ) - h(X u Z 2 ,..., X n _ 
= l -\og(27ie) n \K n \ - Uog{2jte) n - l \K n _ x \ 


U (17.141) 


(17.142) 


= -\og2ne\K n \/\K n ^\. □ 


(17.143) 
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Minimization of a: over a set of allowed covariance matrices {K n } is 
aided by the following theorem. Such problems arise in maximum entropy 
spectral density estimation. 

Theorem 17.10.1 {Bergstr0m [42]) log(|A^ 2 |/|A^- p |) is concave 
in K n . 

Proof: We remark that Theorem 17.9.1 cannot be used because 
\og(\K n \/\K n - p \) is the difference of two concave functions. Let Z = Xq, 
where Xi 〜 A/"(0, S„) ， X 2 〜 T n ) ， Pr{0 = 1} = X = l - Pr{9 = 2}, 
and let Xi, X 2 , 9 be independent. The covariance matrix K n of Z is 
given by 


K n = XS n + (1 - 入 )7；. (17.144) 


The following chain of inequalities proves the theorem: 


X-\og{2neY\S n \/\S n - p \ + (1 - X)-\og{2ne)P\T n \/\T n . p \ 


(a) 


• • • ? ^\,n—p+l 1 -^ 1,1 ? • • • ， ^\,n—p) 

+ (l - m(X 2 ,n, X 2 

，《— 1， • • • ， 又 2 , n — , n — p ) 

(17.145) 


~ ^ » ^n—X » . • • ， ^n-p-\-l 1^1 ? • • . ， ^n—p ? O') 

(b) 

^ Z n —\, • • • ， Z n —p-^-i I Z \, •. • ， 2 n —p) 






(17.146) 

(17.147) 

(17.148) 


where (a) follows from h(X n , X n -\, •. • ， X n - p+ \\X\, • •., X n _ p )= 
h(X \,..., X n ) — h{X \,..., X n - p ), (b) follows from the conditioning 
lemma, and (c) follows from a conditional version of Theorem 17.2.3. □ 

Theorem 17.10.2 (Bergstr0m [42]) |^ /2 |/|^_i| is concave in K n . 

Proof: Again we use the properties of Gaussian random variables. Let 
us assume that we have two independent Gaussian random n-vectors, 
X 〜 A/*(0, A n ) and Y 〜 A/*(0, B n ). Let Z = X + Y. Then 

1 I A n B n I (a) 

-\og2ne' , p /?l , h(Z n \Z n . u Z”— 2 ,..., Zi) (17.149) 

L |A„_i + B n -\\ 
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(b) 

^ |Z w _i ， Z w —2，• • •， 2 \, X n —\, X n —2, . • • ， Xi , Y n — i, Yn—2') • • • ， 


(C) 


办 l^«—1 •> X n —2, • • • ， Xi , Y f? —1, Y n —2, • • • ， 


= E-\ 0 g [lire Var(Z ;1 + Z „_ 2 , …， & ， 7„_i, 

Yn-2, . . . , l^l)] 


(e) 


log [2;re(Var(X”|X„— ! ， X n _ 2 , 


+ y^{Y n \Y n _ u Y n _ 2 ,. 

= E \ X °^ e ij^\ 


， A))] 

\B n \ ' 

\B n \ \ 


^ log r e (^"^7i 

where 


(17.150) 

(17.151) 

(17.152) 

(17.153) 

(17.154) 

(17.155) 


(a) follows from Lemma 17.10.1 

(b) follows from the fact that the conditioning decreases entropy 

(c) follows from the fact that Z is a function of X and Y 

(d) follows since X n + Y n is Gaussian conditioned on X\, X 2 ,..., 
X n -\, Y\,Y 2 ,, Y n -\, and hence we can express its entropy in 
terms of its variance 

(e) follows from the independence of X n and Y n conditioned on the 
past X\, X 2 , • • • ， X n -\, Y\, F 2 ,.. • ， Y n -\ 


(f) follows from the fact that for a set of jointly Gaussian random 
variables, the conditional variance is constant, independent of the 
conditioning variables (Lemma 17.10.1) 


Setting A = XS and B = XT, we obtain 

I 入 & + 入 r"| > ^ 1 s n 1 + y \T n \ 

1 + 1 1 l^«—1 1 \T n —\ I 


(17.156) 


(i.e., \K n \/\K n -i\ is concave). Simple examples show that \K n \/ 
\K n - p \ is not necessarily concave for p >2. □ 


A number of other determinant inequalities can be proved by these 
techniques. A few of them are given as problems. 
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OVERALL SUMMARY 

Entropy. H(X) = P ⑴ log p(x). 

Relative entropy. D{p\\q) = 尸 W bg f§j- 
Mutual information. I(X-,Y) = J2 P(x, j)log . 

Information inequality. D(p\\q) > 0 . 

Asymptotic equipartition property. —- log p(X\, X2, … ， X n ) 
H{X). 

Data compression. H{X) < L* < H(X) + 1 . 

Kolmogorov complexity. K(x) = mmi^ p ) =x l(p). 

Universal probability, log ⑴ ^ K(x). 

Channel capacity. C = max^^) I(X; Y). 

Data transmission 

• R < C: Asymptotically error-free communication possible 

• R > C: Asymptotically error-free communication not possible 

Gaussian channel capacity. C = \ log(l + 吾 ). 

Rate distortion. R(D) = min /(Z; X) over all p{x\x) such that 

^p(x)p(x\x)^i^^ ^ 

Growth rate for investment. W* = maxb* £logb’X. 


PROBLEMS 

17.1 Sum of positive definite matrices . For any two positive definite 
matrices, K\ and K 2 , show that \K\ + K 2 \ > i^ii- 
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17.2 Fan's inequality [200] for ratios of determinants • For all 1 < P< 
n ， for a positive definite K = 7^(1, 2,, n), show that 

\K(p + 1, p + 2,... ,n)\ \K(p + l, p + 2, ... ,n)\ 

(17.157) 

17.3 Convexity of determinant ratios . For positive definite matrices K, 
Kq ，show that ln(|^ + Ko\/\K\) is convex in K. 

17.4 Data-processing inequality . Let random variable X\, X 2 , X 3 , and 

X 4 form a Markov chain X\ ^ X 2 ^ ^ X 4 . Show that 

I(X l ;X 3 ) + I(X 2 ； X 4 )< I(XuX 4 ) + I{X 2 \ X 3 ). (17.158) 

17.5 Markov chains. Let random variables X, Y, Z, and W form a 
Markov chain so that X ^ Y ^ (Z, W) [i.e., p(x, y, z, w)= 
pMp(y\x)p(z, w\y)]. Show that 

I(X; Z) + I(X; W) < I(X; Y) + /(Z; W). (17.159) 


HISTORICAL NOTES 

The entropy power inequality was stated by Shannon [472]; the first for¬ 
mal proofs are due to Stam [505] and Blachman [61]. The unified proof 
of the entropy power and Brunn - Minkowski inequalities is in Dembo 
et al.[164]. 

Most of the matrix inequalities in this chapter were derived using 
information-theoretic methods by Cover and Thomas [118]. Some of the 
subset inequalities for entropy rates may be found in Han [270]. 
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